rxmldataframeparsexml

R parsing XML tree with hierarchical data to dataframe


I am trying to parse some xml documents in R XML--. DataFrame. What I want to do is flatten the XML tree so that I get one row in data frame per each, child. Also I want for each row to contain data from parent

example:

<xml>
    <eventlist>
        <event>
            <ProcessIndex>1063</ProcessIndex>
            <Time_of_Day>2:54:20.2959537 PM</Time_of_Day>
            <Process_Name>chrome.exe</Process_Name>
            <PID>12164</PID>
            <Operation>ReadFile</Operation>
            <Result>SUCCESS</Result>
            <Detail>Offset: 1,684,224, Length: 256</Detail>
            <stack>
                <frame>
                    <depth>0</depth>
                    <address>0xfffff8038683667c</address>
                    <path>C:\WINDOWS\System32\drivers\FLTMGR.SYS</path>
                    <location>FltDecodeParameters + 0x1a6c</location>
                </frame>
                <frame>
                    <depth>1</depth>
                    <address>0xfffff80386834e13</address>
                    <path>C:\WINDOWS\System32\drivers\FLTMGR.SYS</path>
                    <location>FltDecodeParameters + 0x203</location>
                </frame>
                <frame>
                <depth>3</depth>
                    <address>0x7ffea54ffac1</address>
                    <path>C:\WINDOWS\SYSTEM32\ntdll.dll</path>
                    <location>RtlUserThreadStart + 0x21</location>
                </frame>
            </stack>
        </event>
        <event>
            <ProcessIndex>1063</ProcessIndex>
            <Time_of_Day>2:54:20.2960270 PM</Time_of_Day>
            <Process_Name>chrome.exe</Process_Name>
            <PID>12164</PID>
            <Operation>WriteFile</Operation>
            <Result>SUCCESS</Result>
            <Detail>Offset: 103,016, Length: 36</Detail>
            <stack>
                <frame>
                    <depth>0</depth>
                    <address>0xfffff8038683667c</address>
                    <path>C:\WINDOWS\System32\drivers\FLTMGR.SYS</path>
                    <location>FltDecodeParameters + 0x1a6c</location>
                </frame>
                <frame>
                    <depth>1</depth>
                    <address>0xfffff80386834e13</address>
                    <path>C:\WINDOWS\System32\drivers\FLTMGR.SYS</path>
                    <location>FltDecodeParameters + 0x203</location>
                </frame>
                <frame>
                    <depth>26</depth>
                    <address>0x7ffea54ffac1</address>
                    <path>C:\WINDOWS\SYSTEM32\ntdll.dll</path>
                    <location>RtlUserThreadStart + 0x21</location>
                </frame>
            </stack>
        </event>
    </eventlist>
</xml>

And the result that I would like to get is

ProcesnIndex     Time_of_day    Proces_Name     PID     Operation   Result  depth   address     path            location
1063             2:54:20        chrome.exe      12164   ReadFile    SUCCESS 0       0xfffff..   C:\WINDOWS\System32\driv... FltDecodeParameters + 0x1a6c
1063             2:54:20        chrome.exe      12164   ReadFile    SUCCESS 1       0xfffff..   C:\WINDOWS\System32\driv... FltDecodeParameters + 0x203
1063             2:54:20        chrome.exe      12164   ReadFile    SUCCESS 2       0xfffff..   C:\WINDOWS\System32\driv... tlUserThreadStart + 0x21
1063             2:54:20        chrome.exe      12164   WriteFile   SUCCESS 0       0xfffff..   C:\WINDOWS\System32\driv... FltDecodeParameters + 0x1a6c
1063             2:54:20        chrome.exe      12164   WriteFile   SUCCESS 1       0xfffff..   C:\WINDOWS\System32\driv... FltDecodeParameters + 0x203
1063             2:54:20        chrome.exe      12164   WriteFile   SUCCESS 2       0xfffff..   C:\WINDOWS\System32\driv... RtlUserThreadStart + 0x21

I tried using XML package and xmlToDataFrame

xmldf_events_stack <- xmlToDataFrame(nodes=getNodeSet(data_xml_2,"//eventlist/event/stack/frame"))

but that only gives me flatten frames without parent data. Also If I try to parse event data to dataframe, all XML tags are removed from frame field so there is no way for me to parse that later.

Any help or guid in right direction will be appreciated


Solution

  • I solved problem, I am sure there is more elegant way to do this but this is what I did. Hope it helps somebody in the future

    df <- do.call(rbind.fill, lapply(data_xml_2['//eventlist/event'], function(x) { 
      names <- xpathSApply(x, './/.', xmlName) 
      names <- names[which(names == "text") - 1]
      values <- xpathSApply(x, ".//text()", xmlValue)
      framevalues <- values[8:length(values)]
      framevalues <- matrix(framevalues, ncol = 4, byrow = TRUE)
    
      retvalues <- framevalues
      for(i in 7:1){
        retvalues <- cbind(values[i],retvalues)
      }
      colnames(retvalues) <- names[1:12] 
      return(as.data.frame(retvalues))
    }))