Data can also be exchanged dynamically with other systems that use XML. For example, Office, Oracle, Lotus Notes, browsers, HTTP servers, etc.
The markup language for mathematics - MathML - will be important in the research end of statistics, and also to some extend the applied data analysis for specifying models, etc. See Math ML Example and you can fetch the DTD from mmlents.zip
Scalable Vector Graphics is an XML based format for specifying graphics descriptions that can be scaled easily without distortion. We may be using it (or an extension of it) in Omegahat to represent plots. The DTD is available from here.
Since XML is similar to HTML, we can encourage people to use this type of format for different inputs. We have effectively used for defining options with potentially more complicated structures than simple name-value pairs. Hierarchical structures are easily handled by XML. Plot descriptions can be described in this way, and indeed we intend to do this in Omegahat.
This XML-approach is in contrast to a simple ASCII or native object dump which relies
on the receiving system or user to understand the format.
(Communicating via the S4 object ASCII dump format was used
effectively to communicate between Java and S4, but was heavily
dependent on the parsing facilities being migrated to Java, and any
other system engaging in such communication.)
In contrast
with the embedded
Java facilities and CORBA packages for
R and S, XML is a more static representation of data rather than a
live object offering methods.
In addition to providing an environment neutral form of persistence, XML can be used for configuration files, plot template descriptions, documentation, etc.
The aim of providing facilities in R and S for reading XML at the user level is to encourage users to consider the development of DTDs for statistical data and concepts. If we can "standardize" on some basic descriptions, we can exchange data easily between numerous systems, including spreadsheets, etc. These DTDs, coupled with Java interface classes and IDL modules create an integrated framework for open network computing in multiple ways and at multiple user levels. We strongly encourage people to actively make their DTDs available to others.
In the future, we will develop facilities for writing objects from R, S and Omegahat in XML format using the DTDs we develop. A general mechanism for having validated output filters can be created. See Writing XML
Rather than offering one of these styles, we provide functions that
work both ways for R. In S, we currently only support the
document/tree-based approach. xmlTreeParse()
is the tree
based version which generates an internal tree and then converts it to
a list of lists in R/S. This is uses the XML library from Daniel
Veillard of W3.org.
The second function, xmlEventParse()
, is event driven. The user
specifies a collection of R/S-level functions, in addition to the file
name, and the parser invokes the appropriate function as new XML elements
are encountered. The C-level parser we use is Expat developed by
Jim Clark.
Unless you have very large XML documents, if you want to experiment with just one parser, use the first of these, i.e the document-based one. That is the simplest to use, sacrificing control of the creation of the data structures and potential memory growth.
In R, the collection of functions is usually a closure and it can
manipulate local data. In S, these are usually
a list of functions. In order to handle mutable state, one should
use the interface
driver mechanism.
The closure approach is described in more detail in
Docs/Outline.nw
and the R document in man/
.
> x <- xmlTreeParse("data/test.xml") > names(x) [1] "file" "version" "children" > x$file [1] "data/test.xml" > names(x$children[[1]]) [1] "name" "attributes" "children" "value" >
Now we turn our attention to manipulating the previously generated
tree. We can do this in R/S using the following version of
treeApply
.
treeApply <- function(x, func, post=NULL, pre=NULL, ...) { ans <- NULL value <- func(x) if(length(value)) ans <- list(value=value) # If there are any children, do a recursive apply on those also. # If the result is non-null if( length(x[["children"]]) > 0 ) { tmp <- lapply(x[["children"]], treeApply, func, ...) if(length(tmp) > 0) ans$children - tmp } # invoke the post-processing of children hook. if(length(post)) { post(x) } invisible(ans) }Armed with this version of
apply()
, we can start doing
some processing of the tree. First, lets display the type of each node in the tree.
v <- treeApply(x, function(x) cat(class(x),"\n")) named XMLComment XMLNode XMLNode XMLNode XMLEntityRef XMLProccesingInstruction XMLNode XMLNode XMLNodeA slightly more interesting example is to produce a graphical display of the tree. I use PStricks for this purpose. We define a node function that produces the relevant TeX commands and also a
post
function to tidy up the groups.
foo <- function(x) { label <- ifelse(length(x$value), x$value, ifelse(length(x$name), x$name,"doc")) if(length(x[["children"]])==0) { cat("\\Tr{",label,"}%\n",sep="") } else { cat("\\pstree{\\Tr{",label,"}}{%\n",sep="") } } post <- function(x) if(length(x$children) > 0) cat("}\n") treeApply(x, foo, post= post)The result is
\pstree{\Tr{doc}}{% \Tr{ A comment }% \pstree{\Tr{foo}}{% \Tr{element}% \Tr{ }% \Tr{test entity}% \Tr{print "This is some more PHP code being executed."; }% \pstree{\Tr{duncan}}{% \pstree{\Tr{temple}}{% \Tr{extEnt;}% } } } }
Note that the post function is more naturally done
in an event-driven parser, via the
endElement
handler.
Another example is that this document has been carefully constructed
to be parseable by the xmlTreeParse
function.
v <- xmlTreeParse("index.html")
A simple example is where we gather all the character text in the document. In other words, we throw away the XML hierarchical structure and any nodes that are not simply character text.
characterOnlyHandler <- function() { txt <- NULL text <- function(val,...) { txt <<- c(txt, val) } getText <- function() { txt } return(list(text=text, getText=getText)) } z <- xmlEventParse("data/job.xml", characterOnlyHandler()) z$getText() [1] " " [2] " " [3] " " [4] " " [5] "GBackup" [6] " " [7] "Development" [8] " " [9] " " [10] "Open" [11] " " [12] "Mon, 07 Jun 1999 20:27:45 -0400 MET DST" [13] " " [14] "USD 0.00" [15] " " [16] " " [17] " " [18] " " [19] " " [20] " " [21] " " [22] "Nathan Clemons" [23] " " [24] "nathan@windsofstorm.net" [25] " " [26] " " [27] " " [28] " " [29] " " [30] " " [31] " " [32] " " [33] " " [34] " " [35] " " [36] " " [37] " The program should be released as free software, under the GPL." [38] " " [39] " " [40] " "Note that we can discard the lines that are simply white space using the
trim
argument.
This trims all text values. More granularity is needed here.
z <- xmlEventParse("data/job.xml", characterOnlyHandler(), ignoreBlanks=T, trim=T) > z$getText() [1] "GBackup" [2] "Development" [3] "Open" [4] "Mon, 07 Jun 1999 20:27:45 -0400 MET DST" [5] "USD 0.00" [6] "Nathan Clemons" [7] "nathan@windsofstorm.net" [8] "The program should be released as free software, under the GPL."
Much as we did with the tree-based parser, we can construct a display of the structure of the document using the event driven parser.
xmlEventParse("data/job.xml", list(startElement = function(x,...){ cat("\\pstree{\\Tr{",x[[1]],"}}{%\n",sep="") }, endElement = function(x,...) cat("}\n") ))Note that we use a list of functions rather than a closure in this example. This is because we do not have data that persists across function calls.
Parsing the mtcars.xml
file (or generally
files using the DTD used by that file) can be done via the event
parser in the following manner. First we define a closure with
methods for handling the different tags of interest. Rather than
using startElement and looking at the name of the tag/element, we will
instruct the xmlEventParse
to look for a method whose
name is the same as the tag, before defaulting to use the
startElement()
method. As with most event driven
material, the logic is different and may seem complicated. The idea
is that we will see the dataset
tag first. So we define a
function with this name. The dataset
tag will have
attributes that we store to attach to the data frame that we construct
from reading the entire XML structure. Of special interest in this
list is the number of records. We store this separately, converting
it to an integer, so that when we find the number of variables, we can
allocate the array.
The next we do is define a method for handling the
variables
element. There we find the number of variables.
Note that if the DTD didn't provide this count, we could defer the
computation of variables and the allocation of the array until we saw
the end of the variables
tag. This would allow the user
to avoid having to specify the number of variables explicitly.
As we encounter each variable
element, we expect the next
text
element to be the name of the variable. So, within
the variable()
method, we set the flag
expectingVariableName
to be true. Then in the text()
function, we interpret the value as
either a variable name if expectingVariableName
is true,
or as the value of a record if not. In the former case, we append the
value to the list of variable names in varNames
. We need
to set the value expectingVariableName
to false when we
have enough. We do this when the length of varNames
equals the number of columns in data
, computed from the
count
attribute.
A different way to do this is to have an endElement()
function which set expectingVariableName
to false when
the element being ended was variables
. Again, this is a
choice and different implementations will have advantages with respect
to robustness, error handling, etc.
The text()
function handles the case where we are not
expecting the name of a variable, but instead interpret the string as
the value of a record. To do this, we have to convert the collection
of numbers separeted by white space to a numeric vector. We do this
by splitting the string by white space and the converting each entry
to a numeric value. We assign the resulting numeric vector to the
matrix data
in the current row. The index of the record
is stored in currentRecord
. This is incremented by the
record
method. (We could do this in text()
also, but this is more interesting.)
We will ignore issues where the values are separated across
lines, contain strings, etc. The latter is orthogonal to the event
driven XML parsing. The former (partial record per line) can be
handled by computing the number seen so far for this record and
storing this across calls to text()
and adding to the
appropriate columns.
handler <- function() { data <- NULL # Private or local variables used to store information across # method calls from the event parser numRecords <- 0 varNames <- NULL meta <- NULL currentRecord <- 0 expectingVariableName <- F rowNames <- NULL # read the attributes from the dataset dataset <- function(x,atts) { numRecords <<- as.integer(atts[["numRecords"]]) # store these so that we can put these as attributes # on data when we create it. meta <<- atts } variables <- function(x, atts) { # From the DTD, we expect a count attribute telling us the number # of variables. data <<- matrix(0., numRecords, as.integer(atts[["count"]])) # set the XML attributes from the dataset element as R # attributes of the data. attributes(data) <<- c(attributes(data),meta) } # when we see the start of a variable tag, then we are expecting # its name next, so handle text accordingly. variable <- function(x,...) { expectingVariableName <<- T } record <- function(x,atts) { # advance the current record index. currentRecord <<- currentRecord + 1 rowNames <<- c(rowNames, atts[["id"]]) } text <- function(x,...) { if(x == "") return(NULL) if(expectingVariableName) { varNames <<- c(varNames, x) if(length(varNames) >= ncol(data)) { expectingVariableName <<- F dimnames(data) <<- list(NULL, varNames) } } else { e <- gsub("[ \t]*",",",x) vals <- sapply(strsplit(e,",")[[1]], as.numeric) data[currentRecord,] <<- vals } } # Called at the end of each tag. endElement <- function(x,...) { if(x == "dataset") { # set the row names for the matrix. dimnames(data)[[1]] <<- rowNames } } return(list(variable = variable, variables = variables, dataset=dataset, text = text, record= record, endElement = endElement, data = function() {data }, rowNames = function() rowNames )) }A more robust version of this that handles rownames and produces a data frame rather than a is given in the function dataFrameEvents
library(XML)
.
This software is known to run on both Linux (RedHat 6.1) and Solaris (2.6).
To run the R functions, you will need to install either or both of the following packages.
See Installing 3rd party software.The goal is to share this code with an S4/Splus5 version. In order to keep the programming interfaces consistent, we would appreciate being notified of changes.
Having decided to use either libxml and/or expat, you must specify
their locations. Edit the GNUmakefile
, and uncomment the
line defining LIBXML
and/or LIBEXPAT
as
appropriate. Change the value on the right hand side of the = sign to
the location of these directories.
Next, you need to specify whether you are building for R or S4/Splus5.
You can do this via the variable LANGUAGE
in the
GNUmakefile
.
It defaults to R.
All of these can be specified on the command line such as:
make LIBXML=$HOME/libxml-1.7.3 LIBEXPAT=$HOME/expat LANGUAGE=R CC=gcc
cd XML
make LIBXML=$HOME/libxml-1.7.3 LIBEXPAT=$HOME/expat LANGUAGE=R CC=gcc
I have installed using the makefiles here and the
GNUmakefile.admin
in the omegahat source tree version of this. That
however relies on some other makefiles in the R section of the
Omegahat tree. If any one else wishes to package this, please send me
the changes I can make them available to others. Of course you can
use it by just attaching the chapter and using dyn.load()
.
Some of this would be easier if we used either the R or S4/Splus5
package installation facilities. However, I do not have time at the
moment to handle both cases in the common code.
Make sure to specify the location of the library path. Use the
environment variable LD_LIBRARY_PATH
to include the
location of the libxml distribution and also the lib directory in the
expat distribution.
setenv LD_LIBRARY_PATH ${LIBXML}:${LIBEXPAT}/libor, in bash
export LD_LIBRARY_PATH=${LIBXML}:${LIBEXPAT}/lib
unzip XML_1.6-3.zip
library(XML)
!
cd R_HOME/src/library
tar zxf XML_1.6-3.tar.gz
XML/src/
directory.
You will need to provide the names of the directories
in which the libxml2 header files and the libxml2 library
can be found.
cd ../gnuwin32
make pkg-XML
File | Description |
DocParse | parser using libxml. |
---|---|
EventParse | parser using expat. |
RSCommon | File that allows us to use the same code for R and S4/Splus5 by hiding the differences between these two via C pre-processor macros. This file is copied from $OMEGA_HOME/Interfaces/CORBA/CORBAConfig |
Utils | routines shared by both files above for handling white space in text. |
RS_XML.h | name space macro for routines used to avoid conflicts with routine names with other libraries. |
RSDTD | Routines for converting DTDs to user-level objects. |
GNUmakefile | makefile controlling the compilation of the shared libraries, etc. |
expat/ | makefiles that can be copied into expat distribution to make shared libraries for use here. |
libxml/ | makefiles that can be copied into libxml distribution to make shared library |
Src/ | R/S functions for parsing XML documents/buffers. |
man/ | R documentation for R/S functions. |
Docs/ | document (in noweb) describing initial ideas. |
data/ | example functions, closure definitions, DTDs, etc that are not quite official functions. |
expat/
and libxml/
of this distribution) to perform the
necessary operations. A simple way to place these in the
appropriate distribution is to give the command,
make LIBEXPAT=/dir/subdir expatand
make LIBXML=/dir/subdir libxmlThese requires GNU make to be installed.
These makefiles circumvents the regular Makefiles in the distributions.
expat/
.
make LIBEXPAT=/wherever expat
Before doing this, you will have to edit these files to ensure that the correct
values are used for compiling shared libraries. At present, there are
settings for gcc and Solaris compilers.
Edit the file expat/GNUmakefile.lib
and comment out the settings that do not apply to your machine.
Specifically, if you are using the GNU compiler (gcc),
comment out the two lines for the Solaris compilers
(the second settings for PIC_FLAG
and PIC_LD_FLAG
)
libxml/
.
libxml/
to the location you have
installed the libxml distribution.
make LIBXML=/wherever libxml
./configure
.
make
.
libxml/
to the libxml directory.
This can be done via the commands.
cd libxml make LIBXML=/wherever/installed patch
libxml.so : $(OBJS) $(CC) $(SHARED_LD_FLAGS) -o $@ $(OBJS) Cflags: @echo $(CFLAGS)
make libxml.so CFLAGS="-fpic `make Cflags`" SHARED_LD_FLAGS=-shared
make libxmlpatchThis
(PWD=`pwd`; export PWD ; cd $(LIBXML) ; patch -f < $(PWD)/libxml/PATCH.libxml)This works with the GNU patch.
xmlTreeParse()
.