R XML Package

I’ve spent a number of years programming in Java so, during my MSc in
Bioinformatics, it took me a while to become acquainted with the nuances and
the idioms of writing code in R. It has been discussed extensively elsewhere,
little better than John Cook’s lecture R: The Good, The Bad and The Ugly.
While at first I was frustrated with the language, I am starting to become fond
of the language, if not only because of the increasingly rich tooling (such as
RStudio) as well as the packaging system. While unrelated to the field of
Bioinformatics, I have started to write some sample R code for pleasure and
because of the brevity of the code that I can write. I have been working
towards creating a Shiny web app that can visualise exercise data that is
stored in an XML format that is validated against an XML schema. You can see
the code at http://github.com/hiraethus/workout.tracker. For this I have been
using the XML package available from CRAN (kindly authored and maintained by
Duncan Temple Lang) which contains a really useful method

XML::xmlToDataFrame(doc, colClasses = NULL, homogeneous = NA,
                    collectNames = TRUE, nodes = list(),
                    stringsAsFactors = default.stringsAsFactors())

which will take an XML document with a fairly flat structure containing and create a data frame from them. As an example, the following:

<?xml version="1.0" encoding="UTF-8" ?>
        <baz>Not First</baz>

would be rendered as a data.frame of the form

Foo Bar Baz
12 2.1 First
16 1.1 Not first
20 3.3 Last

Each of these columns will be interpreted as strings of characters. The
colClasses attribute of the xmlToDataFrame function allows the classes to be
specified as a vector, for instance c(“integer”, “numeric”, “character”).

This is great! Unfortunately, each of foo, bar and bar elements must be present
in at least one of the foobar elements. If we were to assume that this XML
document could optionally have a foobaz element of the type Boolean and we
specified our colClasses vector as such c(“integer”, “numeric”, “character”,
“boolean”) then if foobaz were not present in our document then xmlToDataFrame
would fail.

The only solution I have come up with to overcome this is to use xmlToDataFrame
without the colClasses argument and then replace each column with another
column that is of the specified type was read in from the XML document. I
currently do this in the

workout.nodes <- XML::getNodeSet(doc=doc, "//workout-tracker/workouts/workout")
df <- XML::xmlToDataFrame(workout.nodes, stringsAsFactors = F)
df$date <- as.Date(df$date)
df$level <- as.integer(df$level)
df$timeInMinutes <- as.integer(df$timeInMinutes)
df$caloriesBurned <- as.integer(df$caloriesBurned)
df$distanceCoveredKm <- as.numeric(df$distanceCoveredKm)
df$recoveryScore <- as.factor(df$recoveryScore)
if ("waistSizeCm" %in% colnames(df)) {
    df$waistSizeCm <- as.numeric(df$waistSizeCm)
if ("weightKg" %in% colnames(df)) {
    df$weightKg <- as.numeric(df$weightKg)

I am more than happy with the time savings the XML package has provided me in
converting my XML document into a Data.Frame in R. My solution to providing
types to the columns of my data frame, while probably very inefficient, is
ample for the few hundred entries I will have (or not depending how well I keep
to my fitness regime).

In the future I will reimplement this application in the Gosu programming
to show how we can use its type loader
system to use an xsd to statically generate objects directly from the xml

, ,

Leave a Reply

Your email address will not be published.