A Statistical Engine within an XSL Translator:S and Xalan

Duncan Temple Lang

01/ 2001

Table of Contents

  1. The Basics of XML and XSL
    1. Connecting S and XSL
      1. Initializing R
        1. Built-in S Functions
          1. Find Functions
            1. Registering New Functions
              1. In XML
              2. In S
          2. Converting Non-Primitive Objects
            1. Examples
              1. Embedding Other Systems
                1. Omegahat/Java
                  1. Access to relational database management systems.
                    1. Python.
                      1. Perl.
                        1. When Embedding isn't Possible
                        2. Other XSL Translators
                          1. Issues and Questions
                            1. Examples
                              1. Plots
                                1. Conditionals
                                2. Footnotes


                                XML will prove to be an important tool for the statisticians, both in how data are exchanged and also in publishing. Research papers, data analyses and educational material can all be developed in a way that better supports the inclusion of code and output from statistical software. XSL is the tool that converts the XML to a more readable form targeted at human readers. The ability to dynamically create and integrate the output from statistical procedures into documents can benefit greatly by an interface between XSL and the statistical software. We describe the simple mechanism by which the S language is embedded within the XSL translator Xalan and how this can be used to create reproducible documents that are in many senses also reports and programs.

                                1. The Basics of XML and XSL

                                2. Connecting S and XSL

                                Xalan provides a convenient mechanism for adding functions to the XSL engine. (At this time, there is no way to add new directives as elements.) We can add any number of functions which map to R functions, and also allow calling arbitrary S functions. For example, we may provide access to mathematical functions such as exponentiation, log, etc.; random number generators; file access; string manipulation; model fitting; and so on. Additionally, we may provide access to the evaluator by providing an eval function in XSL which passes the string (or XML nodeset) to the S evaluator.

                                The idea is that people write XSL rules with which to process nodes in an XML document. These rules can call S functions as if they were built-in functions with the XSL translator. Calling these functions can be used to generate output in R that is inserted into the XML being created. Additionally, one can call S functions within XSL conditional expressions in order to control the XSL processing and what appears in the target document.

                                One can call individual functions and also evaluate S expressions provided as strings. The arguments to these calls can be literal values specified in the XSL file or parameters specified on the command line. Additionally, the inputs can come from the XML file being processed and the nodes available to the XSL rule. And finally, the input values can be variables in S that were created in earlier calls. This allows us to integrate inputs from a variety of different sources and to also treat the S session as a worksheet with connected data that persists for the duration of the transformation.

                                1. Initializing R

                                When one calls an R-XSL function (either directly or indirectly), the R session will be initialized. By default, this is done with a single argument --silent which avoids verbose output from R. Initializing the session in this fashion will typically work well for most applications. The usual R startup will take place, including looking for a .Rprofile, etc. Check if .Renviron is read in a shell script or from the C code If one needs to do some computations before others are performed such as initializing variables, loading libraries, etc. then one can do these in the XML or XSL file via <code> tags.

                                However, there may be occassions in which one needs to control how the R session is initialized. We support several ways to this. The first two of these are implemented, and the third is just a design.

                              2. Command line
                              3. One can identify arguments on the command line that are to be passed to R using the --R argument. All arguments after this (and before any other flags identifying arguments for another system) are then gathered together and passed to R when it is initialized. Rxslt -in foo.xml -xsl foo.xsl --R --gui=none --no-site-file
                              4. r:init()
                              5. We provide an XSL function that initializes the R engine, passing the arguments to this XSL function as command line arguments for R. One can call this function from within an XSLrule. For example, <xsl:template match="article"> <xsl:if test="r:init('--gui=none','--no-site-file') < 0"> Error initializing R </xsl:if> </xsl:template> Note that we don't have an argument 0 referring to the name of the program by which the process was invoked.
                              6. XML
                              7. Finally, one can put the arguments in an XML file and pass this to Rxslt. The file, say Rinit.xml, should look something like <s:init> <arg>--gui=none</arg> <arg>-min-vsize=4M</arg> </s:init> Then, it one identifys this as a file that should be read to get the commands for initializing S using the -R-init-file as in
                                 Rxslt -in foo.xml -xsl foo.xsl -R-init-file Rinit.xml
                                The advantages of the different approaches relate to what is changing most frequently. If one needs to specify different initialization arguments per run, the command line works well. If one needs different initialization arguments for different XSL filters, putting them into the specific XSL files using r:init is the simplest way to do this. And it one wants to re-use the initialization steps across different XSL and XML input files, specifying them in a separate, reusable file via the -R-init-file is most convenient.

                                2. Built-in S Functions

                                We have added some specific functions that you can call as if they were built-in XSL functions. These are
                              8. eval
                              9. source
                              10. This calls a special version of the source which executes the standard one and discards the return value.
                              11. sqrt
                              12. Invokes the square root function. It should be given a number which can be specified either as a literal, or as the return value from invoking the number function in XSL.
                              13. date
                              14. Calls the date function, returning a string containing the current date and time.
                                It is more efficient and flexible to explicitly register these routines with the XSL translator rather than letting Xalan not find it and assume it is an S function. The mechanism is easy and it can be done by simply adding the name of the S function to a list that is added when the XSL transformer starts.

                                3. Find Functions

                                We can extend the mechanism for resolving external function references in the Xalan engine. We can do this by providing a method for the XSLT engine's mechansim for matching names to functions and having it query S to see if there is a function bound to the name of interest. For example, if the author of the XSL input calls foo, we will look up the table of external functions. If it is not a built-in function (either for the basic XSL translator or our extended version of it) and hence not in the table, we then use a second lookup approach by using the C-level equivalent of
                                 get("foo", mode="function")
                                The functionAvailable in XPathEnvSupportDefault appears to be the one of interest. We want to extend this class and have an instance of our S-specific version be instantiated. The implementation of the method in the new class is simply a call to the inherited method. If this returns false, we call the exists function in S to see if such a function exists. We also extend the method findFunction to actually find the function. We can be smart about storing the information about which functions are found by the call to exists in S and which are regular XSLT functions.

                                The question that remains is how to cause our new class to be used. The main currently creates an instance of XPathSupportDefault

                                It appears that we can extend the class XSLTProcessorEnvSupportDefault and change its instance of the m_defaultSupport. Then, if we can set this field, we can override the way that we find functions. Part of the current problem is this delegation. The XSLTProcessorEnvSupportDefault has methods for findFunction and extFunction but it delegates these calls to the the XPathEnvSupportDefault instance it has as a field. If we can set that field with an instance of a class that extends XPathEnvSupportDefault and performs the lookup mentioned above, then all will work.

                                It turns out that this is difficult to achieve because of the way the Apache code is structured. While we can override the XPathEnvSupportDefault, we cannot set the field m_defaultSupport in the XSLTProcessorEnvSupportDefault since it is private. By making it protected or by providing a set accessor method for it, we can set it in the constructor. However, this causes problems since the assignment operator for the XPathEnvSupportDefault class is private (and is not implemented).

                                The current solution is to extend the XSLTProcessorEnvSupportDefault class and override the extFunction method. We call the default one (which uses delegation to the m_defaultSupport) and catch the exception this throws if the function is not found. In the exception handler, we instantiate a new NamedRFunction and execute it. This does incur the overhead of an exception throw which would nicer to avoid.

                                The <paste> example in test/report.xml and test/report.xsl demonstrate how to use this.

                                4. Registering New Functions

                                As with all inter-system interfaces, there is an issue about how to convert XSL arguments to S values and how to return the result of the function. In many cases, we know a great deal about the functions and what type of arguments they expect and return. We allow the user to "register" these functions. The use can specify
                                1. the name of the function
                                2. the number of arguments
                                3. the details of the arguments, consisting of any or all of
                                In this context, there are 5 possible types of values for the arguments [1] :

                                In XML

                                Currently, one specifies the details of the S-XSL functions in a separate file. These are first processed by the XSL translator by applying a specialized (and perhaps compiled) XSL stylesheet to read them. These register the functions with the XSL translator. Then the real target file is processed.

                                In the future, we may decide to extend the XSL elements to support an xsl:register element. This would allow the registration of the functions to be performed within the XSL file (or one that is included in the main XSL input file). We are hesitant to do this as it would make our XSL non-standard. However, the developers of Xalan have indicated that it will support element extensibility and so this will be more common.

                                We should note that specifying all the details of function may be cumbersome and inconvenient. A simpler version of the registration would allow a signature to be specified without detailing the argument names, etc. For example,

                                    <signature value="N,I"/>
                                For the moment, we require the full form. Since this information is specified infrequently and one can avoid registering the function at all, this hardly seems like a serious problem. In the future we may add some shortcuts.

                                Alternatively, we can use S's reflectance mechanism and its ability to create XML document fragments to create much of the registration specification. Using the S4 methods, one also has information about types of arguments and the specification can be completely determined by S.

                                In S

                                We can also provide facilities for registering XSL functions from within S. These would add the entry to the XSL translator's function table, making them available to the XSL engine and the stylesheet. The purpose behind this interface is to provide S programmers with a familiar and powerful mechanism to specify the details of these functions. It is more convenient to specify information about functions within that language, as this can be done programmatically rather than manually. Also, it is more convenient since S is a more powerful programming language than the XML/XSL combination.

                                The basic idea is simple. We provide a function, registerXSLFunction which allows the caller to specify the details of the XSL function to be registered. These include the name, signature, return type and argument names and default values.

                                 registerXSLFunction <- 
                                 function(name, returnType=NULL, signature="", names=NULL, defaults=NULL) {

                                3. Converting Non-Primitive Objects

                                It is obvious how to convert numbers, strings and logical values between XSL and S. This gets slightly more complicated when returning vectors of these types containing more than one value from S to XSL. And it is significantly less obvious as to how to handle non-primitive objects such as named vectors, lists and objects with a class attribute. We use a relatively simple and hopefully appropriate approach at present. (It can be easily changed if anyone has any better suggestions.)

                                Suppose we execute the expression

                                where x is a numeric vector. The result is an object of class table containing different quantiles and the mean of the collection of numbers. We want to convert this object to an XML node, or more precisely to a fragment of a document, that can be inserted into the XML document being created by the XSL translator.

                                How do we do create this XML from S? Simple. We arrange to have a method (or part of the standard output or characteristic of the default input reader) that converts the result objects to XML. It does this in the obvious manner:

                                1. if there is a method for converting the specific object to an XML fragment, it uses that; otherwise,
                                2. it uses a default approach by creating an XML representation by recursing the structure of the S object.
                                When the S representation of the XML fragment is created, this is passed to the C++ code that brokers the interface between the XSL translator and S, and this converts it to a ResultTreeFragment.

                                4. Examples

                                Let's consider the following simple XML element.
                                The idea is that we want the output from the S expression summary(rnorm(10)) to be inserted in place of the code. We can arrange for this to happen with the following XSL rule:
                                 <xsl:template match="s:output">
                                   <xsl:value-of select="r:eval(string(./code))">

                                5. Embedding Other Systems

                                1. Omegahat/Java

                                No need, since we can integrate it directly with the Java version of Xalan.

                                2. Access to relational database management systems.

                                We might want include the output from SQL expressions in Relational database servers within a document.

                                3. Python.

                                4. Perl.

                                5. When Embedding isn't Possible

                                Why are we implementing this functionality when it involves writing low-level code and requires additional thought by the authors. The answer is relatively simple and is itself a question - what's the alternative?. The same effect could be achieved by passing the original XML document to S and have it parse the contents (using the XML package) and processing each of the code sub-nodes within the tree by evaluating their contents and substituting it with the XML representation of the output. Since we could do that in the absence of the tools described in this document, one can still do that. The difficulty with it is that it is clumsy, and more importantly the two filters cannot communicate with each other in a backward-forward interaction. Firstly, we cannot use S functions within constructs such as logical expressions within test clauses of if, when, ... elements or the select or match attributes of value-of or template elements. Perhaps more importantly, while XSL encourages locality in its rules, the XSL templates can access other nodes in the document while processing a node. In the two-filter approach, the S filtering cannot access the other nodes without considerable effort which would amount to providing an implementation of XSL (or some of its features) within S [2] Also, the two-filter approach, as with so many inter-system interfaces uses strings to communicate between the two stages. This reduces the information within the system by removing the types of the values.

                                When that application cannot be embedded in the XSL translator, we can use a Remote Method Invocation mechanism to objects in a separate process using CORBA, RMI, DCOM, etc. Additionally, we can embed the XSL translator into other applications if we can either a) recompile that application, or b) dynamically load code into it. Neither is obviously feasible for all applications.

                                6. Other XSL Translators


                                7. Issues and Questions

                                When we use strings in XSL to represent variable names in R we cannot tell the difference between a literal and a variable name unless we require uses to specify string literals. For example, in the XML snippet <tag attr1="x" attr2="'x'">...</tag> the first attribute is intended to be a variable name and the second is a literal string.

                                We may want to experiment with each in examples to determine what is the useful default?

                                Alternatively, we can introduce a function, e.g. r:variable which would take a name as input and return an object that would tag it as a variable. The issue here is how to represent that within the XSL translator (as a return value).

                                And another, perhaps signficantly simpler, approach is to use substitute and do.call and symbolic friends.

                                8. Examples

                                1. Plots

                                We can use an XML document as a report template. We can then regularly create an actual report from the template based on the current data, such as the daily values, etc. For example, we take a simple document that produces summaries of two variables obtained from a database when the document is generated. It provides summary statistics for both variables, a histogram of each variable and a scatterplot. It writes the data Note that this can be done in a regular programming language such as S, Perl, etc. However, it may not be as convenient since we are removing the document layout aspect from that environment. We cannot for example easily spell-check any text in the document. We cannot edit the document easily. By having inserts into the document, we can allow the authors to use these code segments in different ways. This allows people to easily change the appearance of the report without re-programming in lower-level languages.

                                2. Conditionals

                                We can use S within the conditions of <if> and <choose> statements/operations. We might want to insert text conditionally based on the value of the specified node. For example, suppose we are processing temporal data and want to render values that are earlier than a specified date by coloring it red, while observations after that time period will be blue. We may provide this date as a parameter or have it as a value in S. Then, a command
                                 <xsl:element name="font">
                                   <xsl:attribute name="font">
                                    <xsl:if test="r:compareDates(,$cutOff) < 0">red</xsl:if>
                                    <xsl:if test="r:compareDates(,$cutOff) > 0">blue</xsl:if>
                                    <xsl:apply-templates select="date">
                                    <xsl:apply-templates select="value">


                                1. These correspond to the types that are supported by the XObject class.
                                2. This is feasible, especially since XSL is written in XML and so we can read both the XML and XSL documents. Implementing the XSL actions requires significantly more labour