Storing XML node values with R's xmlEventParse for filtered output -
i have huge xml file (260mb) tons of information looking this:
example:
<mydocument> <positions eventtime="2012-09-29t20:31:21" internalmatchid="0000t0"> <frameset gamesection="1sthalf" match="0000t0" club="referee" object="00011d"> <frame n="0" t="2012-09-29t18:31:21" x="-0.1158" y="0.2347" s="1.27" /> <frame n="1" t="2012-09-29t18:31:21" x="-0.1146" y="0.2351" s="1.3" /> <frame n="2" t="2012-09-29t18:31:21" x="-0.1134" y="0.2356" s="1.33" /> </frameset> <frameset gamesection="2ndhalf" match="0000t0" club="referee" object="00011d"> <frame n="0" t="2012-09-29t18:31:21" x="-0.1158" y="0.2347" s="1.27" /> <frame n="1" t="2012-09-29t18:31:21.196" x="-0.1146" y="0.2351" s="1.3" /> <frame n="2" t="2012-09-29t18:31:21.243" x="-0.1134" y="0.2356" s="1.33" /> </frameset> </positions> </mydocument>
there around 40 different frameset nodes, each different gamesection="..."
, object="..."
.
i love extract information of <frame>
nodes list
object cannot load whole xml file because large. there way, can use xmleventparse
function filter specific gamesection , specific object , information corresponding <frame>
elements?
it might 'internal' representation not large
xml = xmltreeparse("file.xml", useinternalnodes=true)
and xpath best bet. if doesn't work, you'll need head around closures. i'm going aim branches
argument of xmleventparse
, allows hybrid event parsing iterate through file, coupled dom parsing on each node. here's function returns list of functions.
branchfactory <- function() { env <- new.env(parent=emptyenv()) # safety frameset <- function(elt) { id <- paste(xmlattrs(elt), collapse=":") env[[id]] <- xpathsapply(elt, "//frame", xmlattrs) } <- function() env list(get=get, frameset=frameset) }
inside function we're going create place store our results iterate through file. list, it'll better use environment. allow insert new results without copying results we've inserted. here's our environment:
env <- new.env(parent=emptyenv())
we use parent
argument measure of safety, if it's not relevant in our present case. define function invoked whenever "frameset" node encountered
frameset <- function(elt) { id <- paste(xmlattrs(elt), collapse=":") env[[id]] <- xpathsapply(elt, "//frame", xmlattrs) }
it turns out that, when use branches
argument, xmleventparse
have arranged parse entire node object can manipulate via dom, e.g., using xlmattrs
, xpathsapply
. first line of function creates unique identifier frame set (? maybe that's not case full data set? you'll need unique identifier). parse "//frame" part of element, , store in our environment. storing result trickier looks -- we're assigning variable called env
. env
doesn't exist in body of frameset function, r uses lexical scoping rules search variable named env
in environment in frameset function defined. , lo, finds env
have created. add result of xpathsapply
to. that's our frameset node parser.
we'd convenience function can use retrieve env
, this:
<- function() env
again, going use lexical scoping find env
variable created @ top of branchfactory
. end branchfactory
returning list of functions we've defined
list(get=get, frameset=frameset)
this surprisingly tricky -- we're returning list of functions. functions defined in environment created when invoke branchfactory
and, lexical scope work, environment has persist. we're returning not list of functions, also, implicitly, variable env
. in brief
we're ready parse our file. creating instance of branch parser, it's own unique versions of get
, frameset
functions , of env
variable created store results. parse file
b <- branchfactory() xx <- xmleventparse("file.xml", handlers=list(), branches=b)
we can retrieve results using b$get()
, , can cast list if that's convenient.
> as.list(b$get()) $`1sthalf:0000t0:referee:00011d` [,1] [,2] [,3] n "0" "1" "2" t "2012-09-29t18:31:21" "2012-09-29t18:31:21" "2012-09-29t18:31:21" x "-0.1158" "-0.1146" "-0.1134" y "0.2347" "0.2351" "0.2356" s "1.27" "1.3" "1.33" $`2ndhalf:0000t0:referee:00011d` [,1] [,2] [,3] n "0" "1" "2" t "2012-09-29t18:31:21" "2012-09-29t18:31:21.196" "2012-09-29t18:31:21.243" x "-0.1158" "-0.1146" "-0.1134" y "0.2347" "0.2351" "0.2356" s "1.27" "1.3" "1.33"
Comments
Post a Comment