The basic idea of the online database access (odbAccess) is essentially to be able to turn an HTML form into an S function. Rather than having to manually interact with the form via a Web browser, S users can use the functionality of the form programmatically. This can be used to avoid manual intervention and can greatly simplify repetitive tasks. It makes the process reproducible since it is programmatic rather than based on human actions. And, importantly, it reduces the opportunity for error in transferring the inputs to the form, saving the ouptuts to a file and bringing them into R. In short, it reduces the number of steps involved, while maintaining, at least, the same functionality.
The basic procedure is as follows. To get programmatic access to the HTML form's functionality in R, we have the following steps.
Identify the HTML page of interest. We will assume that it contains only one HTML form. Some pages do house 2 or more forms and we will deal with this later.
We read the form using
htmlTreeParse
in the
XML package. We use the
formElementHandlers
function to customize the
parsing to collect a description of each of the HTML form elements in
the page. Having finished the parsing, we have an S object of class
HTMLFormDescription which describes the entire
form, i.e. all its HTML form elements, its URI (if available),
parameters for how to submit the form (e.g. the action).
From this description of the form, we construct an S function to mimic the HTML form.
The function has formal arguments or parameters corresponding to the HTML form elements that one could change in the Web browser. For those HTML form elements that provide a default value, e.g. a selected radio button or checkbox, text in a textarea, or a selected menu item, we provide a default value for the corresponding parameter in our new function. We put the parameters for those HTML form elements with no default setting first in our function definition so that the caller has to supply values for these but not necessarily the other parameters.
Note that we don't include the HTML elements that are "hidden" (i.e. <input type="hidden">) in our parameter list. One cannot set these in a browser, so one cannot specify them in our new function. They will be submitted along with the user's values as part of our request, however.
An example will help make these steps more concrete and clear. Let's take a simple sample HTML form at http://www.speakeasy.org/~cgires/perl_form.cgi. You might want to visit that page using your Web browser and submit the form to understand what it does and what it returns. If we were to read the HTML, we would find that it contains six form elements. There is a textarea, a collection of radio buttons, a collection of checkboxes, and a pull-down menu. The additional two elements are buttons for submitting and resetting the form. We don't have to worry about these as they don't correspond to actual inputs in a programmatic language. They are simply user interface controls.
The names of the four elements of interest are "some_text", "box", "choice", "radbut" and these are identified by the <html:attribute>NAME</html:attribute> attribute on the <html:tag>INPUT</html:tag> element in the HTML document. When we submit the form (via a Web browser or in R), we must provide values for each of these names. The values come from the form elements. Let's look at the radio buttons
<input type="radio" name="radbut" value="oop" checked> Oop
<input type="radio" name="radbut" value="eep"> Eep
<input type="radio" name="radbut" value="urp"> Urp
We see the <html:attribute>name</html:attribute> attribute
and also a different <html:attribute>value</html:attribute>
for each button. The text that we see in the browser
is not what is actually submitted. Instead, it is the
associated <html:attribute>value</html:attribute>.
So if we clicked on the Eep button, we would
be submitting
radbut=eep
Note also the <html:attribute>checked</html:attribute> on the first button. This means that it is selected by default. We can use this to provide a default value for our function.
At this point, we might think about writing the outline of the function. We would start with a name for the function. Any name will work, but let's call it
perl.form.cgi
. We
are using the name of the script in the URI, discarding the prefix and
changing the _ to a . so it can be used easily in S. We do the same
thing for each of the parameter names, i.e. change the _'s to a . (or
capitalize the words separate by the _, e.g. someText for some.text).
So the first attempt at our function would be to define it something like the following:
perl.form.cgi = function(some.text, box, choice, radbut = "oop") { }
Note that we provide a default value for radbut since
that was the checked radio button.
The next step is to find the description of the form. Well, we have read the HTML manually. So now we would have to create the S objects that describe the HTML form and its elements. Well, we can do this programmatically. First, we create the machinery that will process the HTML elements and create S objects to describe the form. Then we parse the HTML file and extract the form description.
h = formElementHandlers("http://www.speakeasy.org/~cgires/perl_form.cgi") invisible(htmlTreeParse("http://www.speakeasy.org/~cgires/perl_form.cgi", handlers = h)) form = h$values()
From this, we can find the names of the form elements:
names(f$elements)
[1] "some_text" "box" "choice" "radbut"
as before. Looking at the radbut element,
f$elements$radbut
$name [1] "radbut" $defaultValue [1] "oop" $options [1] "oop" "eep" "urp" attr(,"class") [1] "HTMLRadioElement" "HTMLFormElement"
we see the default value is available to us.
The package provides a function (
createArgList
) to create the
argument list.
cat(createArgList(form$elements, form$url))
some.text, box, choice = 'Ha', radbut = 'oop', .url ='http://www.speakeasy.org/~cgires/perl_form.cgi'
So choice and radbut have default values.
Returning to the new function we are developing, we need to have the form description around. Since we started with that (from the
htmlTreeParse
call), we can serialize that to
a file using the
save
command.
save(form, file = "perl.form.cgi.rda")
Then, in the function, we can load this object back into R.
e = new.env() load("perl.form.cgi.rda", envir = e) form = get("form", envir = e)
We can make this into a little function.
We, of course, have to worry about where the file is actually
located. However, we will probably put this function into an R package
and so we can find it in the data
directory of the package, e.g.
data("perl.form.cgi", package = "myHTMLForm")
or
system.file("data", "perl.form.cgi.rda", package = "myHTMLForm")
If we don't want to make this into a package for us or others to use in the future, we will probably elect simply to create the function directly in R. In that case, the form description object can be stored in the environment of the function.
perl.form.cgi = function(...) { } e = new.env() assign("form", form, envir = e) env(perl.form.cgi) = e
At this point, we have seen several ways to store and retrieve the description of the form. The next step is to gather the inputs to the HTTP request. We could be cute and pass control to a general function that would process the function call with respect to the form description and work from there. That unfortunately makes the mechanism somewhat obscure, especially when things go wrong and error messages are not entirely consistent. So, following the KISS[1] principle, we will avoid this (at least for the present!). Instead, we collect the arguments passed to the function corresponding to the form elements into a list. We pass this and the additional values (the URI, the form description) to a general function that can dispatch the calls.
This general function (
formQuery
)
is then charged with
So, in summary, we use the following steps.
formElementHandlers
xmlTreeParse
formQuery
form = getHTMLFormDescription("http://www.speakeasy.org/~cgires/perl_form.cgi") z = createFunction(form)
z("Duncan", "box1") z("other", "box1, box2", "eep")
curlOptions
z("Duncan", "box1", verbose = TRUE, header = TRUE)
We'll try to create a interface from R to Google. Just submitting a basic query is not that challenging since there is only one input field - the text field in which one types the search string. The advanced search page is more interesting for many reasons. Firstly, there are many form elements. Secondly, there are many forms!
First, we ask for a description of Googles search
google = getHTMLFormDescription("http://www.google.com/advanced_search?hl=en")
Note that when using
getHTMLFormDescription
,
there may be warnings generated. It is always a good idea to check
these and verify that they are safe to ignore. In most cases, the
warnings are related to malformed HTML in the document containing the
forms. Even well-formed HTML often has HTML entities that are not
properly specified or defined for the parser.
One should also be aware that the page we are downloading via
getHTMLFormDescription
may be different from the one you see in your
browser. Indeed, the page may be different if you go to another
browser (e.g. Mozilla rather than Opera or Microsoft's IE). The
reason for this is that the Web server can return pages based on the
request it receives which provides a description of the requester via
the User-Agent field. Some sites customize the returned page to work
"optimally" with that application. If it does not recognize this, then
it may give a more generic page, sometimes with some features not
present or differently implemented. If one wants the exact page you
see in a particular browser, one can specify the details of the
request in the HTTP header via the RCurl
options.
library(odbAccess) f = getHTMLFormDescription("http://www.wormbase.org/db/searches/advanced/dumper")
We get some warnings about the HTML in the page.
wormbase = createFunction(f) x = wormbase("AC3.8 M7 X IV III:1000..4000")
The result is a single string
> length(x) [1] 1 > nchar(x) [1] 2497069
This takes a long time to complete. If you do this in a browser, it also takes a long time to download the full page. It does however show the initial part as it is received which gives the impression that it is complete at that point. This is, however, 2 megabytes of text.
g("unc-30", feature="Expression chip profiles")
Unfortunately, this is a complicated function.
When we select the species, the options for the other
elements change.
form.briggsae = getHTMLFormDescription("http://www.wormbase.org/db/searches/advanced/dumper?species=briggsae") g.briggsae = createFunction(form.briggsae) g.briggsae("cb25.fpc2397", feature = "Genefinder genes")
This returns the value as a single string. This is because that is what the lower-level call to submit the form uses
basicTextGatherer
to collect
and return the chunks of the response
as a single string.
We can, however, provide our own collector for the chunks.
We might want to collect this into the separate chunks.
Alternatively, we might want to explicitly process it
as it arrives.
Let's consider the first of these, i.e. storing the response in chunks. We can provide our instance of the
basicTextGatherer
function. We can ask the
form submitter to call this for each chunk and then we can request the
value ourselves with the appropriate argument to avoid collapsing the
chunks into a single value.
h = basicTextGatherer() wormbase("AC3.8 M7 X IV III:1000..4000", writefunction = h$update) chunks = h$value(NULL)
We could also use
readLines
to get the individual lines.
x = worbase("AC3.8 M7 X IV III:1000..4000") tmp = readLines(textConnection(x))
Now let's think about processing the chunks as we encounter them. The data look something like
>cTel54X.1 (cTel54X.1:508..272)
aaccttttgggaagaagtacattccaacaatgcatcttctggatctacggtaccatctcagttgaagttgcaagagtgttcaaacccaattattctggtg
gcgatgaatatgaattcgagtattttaacatgtataacaagtttctgatcagttgtgtcaaggctgaatctattggtttcagttggaaatttattattaa
gcaatgtgttgtgacgtccgcttgtaccatttgctga
>cTel54X.1 (cTel54X.1:1765..1682)
ccaggatcacatttcccctactctttccgatatgcccacagagatccttggtcaagtattggaaaagctaaagccagtggatca
>cTel54X.1 (cTel54X.1:1918..1818)
atggtaacaatcagccgagaagcccgacgtgctctgactgaagaacgtggagctgaacttgacaaatggtgggcaattaattttgtgcgatatatgaatt
t
If we had all the lines, we might break these up into blocks by finding the indices of the lines starting with '>'
Some pages have multiple forms. We have to parse these slightly differently to obtain the form descriptions. Rather than accumulating all the form elements in a single location, we have to put them with the right form. There are several approaches to doing this. One is to bring the entire page into R as a tree and then to recursively descend that find the different forms. This is a very different procedure than the one we used for a page with a single form, so we would prefer not to use this. It is also slightly inelegant code. Another approach is to use handlers argument the
htmlTreeParse
function again. We can
separate the code that was originally in the HTML form handlers as
basic constructors and updating state. Basically, most of the
functions prepared some object and then assigned it to the list of
elements. These are clearly two separate tasks, and the first can be
reused. When processing multiple forms, we want to process the
individual form elements but not assign them to anywhere local.
Instead, we want to put them back into the XML tree. Then, when we
hit the <html:tag>FORM</html:tag> element, we want to fetch only those
elements.
There is one important aspect of forms that we haven't yet covered in our discussion about how this mechanism works. Certain forms are "dynamic" in the sense that changing a value on the form causes it to dynamically determine different possible values for the other form elements. The wormbase dumper is an example of this where one can change the species from C. elegans to C. briggsae and the possible feature sets that one can query is updated in the new view of the form. This is actually done by submitting the form with the new species selection and retrieving a new HTML document with the new form elements.
What this means for us is that the set of valid inputs are dependent on values for the other inputs. To handle this at run time, we want to have the names of the elements whose values alter the possible values of other elements. Let's start with the simple case of the wormbase dumper form above. This is remarkably simple. There is only one element that causes the form to be updated. This is the species menu which has only two possible values - elegans and briggsae. For each of these two possible values, we could store the form elements that make sense for the particular value.
For species, we want a list with a form description for briggsae and another one for elegans.
u = odbAccess:::mergeURI(URI(f$formAttributes["action"]), URI(f$url)) description = list() for(i in names(f$elements$species$options)) { o = formQuery(list(species = i), toString(u), f$elements, f$formAttributes, .checkArgs = FALSE) descriptions[["species"]][[i]] = getHTMLFormDescription(o, asText = TRUE) }
The general function for the HTML form has access to this description.
As it processes the arguments in the call, it descends down the
different paths. As it processes the species argument, it validates
this against the possible values for this element in the regular form
description. Based on the value for species, it then looks
at the remaining form elements from the form description for that value
of species.
There are two possibilities. One is that the set of elements remain the same in the form and only the values change. And a second is that the set of elements change. The latter is the more complicated situation. We can accomodate both using the model for the second case. If we have a list of dynamic form elements (DynamicHTMLFormElementsList), then we process these by branching sequentially.
If the set of parameters (rather than possible values) change dynamically based on a value for one parameter, the interface is quite complicated. For example, suppose that we have a form with an option A or B (e.g. in a menu or radio button). Selecting A means that one specifies a value for a parameter named height. Otherwise, if B is selected, one provides a value for weight. If we were to map this to a function in R using the scheme described above we would have a function of the form
function(option, height, weight) { if(option == "A") args = list(height = height) else if(option == "B") args = list(weight = weight) }
This is not exceedingly complex, but one can see that the complexity
increases rapidly if there are more form elements to deal with. For
example, suppose that if we select B, we not only specify weight, but
age and occupation. But if we select A, we provide values for height,
weight and date of birth. In this case, one might reasonably argue
that the HTML form is itself overly complicated. However, it does
provide dynamic, visual information about what is expected that can
guide the user interactively. For an S function, there are no such
dynamic cues. Instead, the user is faced with a flat collection of
possible arguments which are in fact grouped and inter-dependent in a
complex fashion. In many respects, it makes more sense to have
different functions for different possible options, e.g. one for A and
one for B where the arguments are all logically related. For this
reason, we do not provide support for converting these types of
dynamic, complex forms into S functions. The infrastructure is
present in the package so that this could be done relatively easily,
but they are not .
Now that we have discussed the basic idea of dynamic forms, let's turn our attention to how things are done in the software. The checkDynamic for the form element handler functions
formElementHandlers
and
multiFormElementHandlers
controls whether we
further resolve the form description to check whether it is a dynamic
form. If it is a dynamic form, we add an object that provides
information about the different options and the resulting possible
values for the different inputs. This is used when validating the
inputs to the S function. If one does not want the dynamic
information (which can be slow to fetch as it requires a request to
the HTTP server to submit the form for each option of the dynamic
element), one can avoid the computations by passing creating the
handler instance oneself with checkDynamic passed as
FALSE
.
Currently, there are two fundamental limitations that our approach cannot deal with. The first is that we do not (and will not) attempt to handle forms that rely on JavaScript code. It is conceivable that we could do something to handle this (e.g. embed a JavaScript interpreter), but we most likely will not. We do however recognize the simplest of JavaScript commands - submit - within certain types of elements of a form to identify whether the form is dynamic.
The second limitation is the dynamic forms. If the contents of the form elements change in response to selecting a particular value of one element, there can be issues.
Merge the URI with the action URI when we process the form, not at run-time. If the user specifies the URI at run-time, leave it alone and demand the entire thing.