R: parse JSON/XML exported compound properties from Pubchem

I would like to parse all chemical properties of a given compound as given in Pubchem in R, using the JSON (or XML) export facility.

Example: ALPHA-IONONE, pubchem compound ID 5282108

https://pubchem.ncbi.nlm.nih.gov/compound/5282108

library("rjson")
data <- rjson::fromJSON(file="https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/5282108/JSON/?response_type=display")

library("RJSONIO")
data <- RJSONIO::fromJSON("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/5282108/JSON/?response_type=display")

will get me a tree of nested lists, but how do I go from this rather complicated list of nested lists to a nice dataframe or list of dataframes?

In this case, what I am after is everything under

3.1 Computed Descriptors

3.2 Other Identifiers

3.3 Synonyms

4.1 Computed Properties

in a single row of a dataframe and each element in a separate named column with multiple items per element (e.g. multiple synonyms) pasted together with a "|" as a delimiter. E.g. in this case something like

pubchemid      IUPAC_Name    InChI       InChI_Key     Canonical SMILES      Isomeric SMILES     CAS     EC Number     Wikipedia      MeSH Synonyms     Depositor-Supplied Synonyms   Molecular_Weight    Molecular_Formula    XLogP3   Hydrogen_Bond_Donor_Count ... 
5282108        (E)-4-(2,6,6-trimethylcyclohex-2-en-1-yl)but-3-en-2-one       InChI=1S/C13H20O/c1-10-6-5-9-13(3,4)12(10)8-7-11(2)14/h6-8,12H,5,9H2,1-4H3/b8-7+ ....

Fields with multiple items, such as Depositor-Supplied Synonyms could be pasted together with a "|", e.g. value could be ALPHA-IONONE|Iraldeine|...

Second, I would also like to import section 4.2.2 Kovats Retention Index as a dataframe

pubchemid      column_class            kovats_ri
5282108        Standard non-polar      1413
5282108        Standard non-polar      1417
...
5282108        Semi-standard non-polar 1427
...

(section 4.3.1 GC-MS would have been nice too, but since it only displays the 3 top peaks this is a little useless right now, so I'll skip that)

Anybody any idea how to achieve this in an elegant way?

PS Note that not all these fields will necessarily exist for any given query.

2D structure and some properties can also be obtained from

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/5282108/record/SDF/?record_type=2d&response_type=display

and 3D structure from

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/5282108/record/SDF/?record_type=3d&response_type=display

Data can also be exported as XML, using

https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/5282108/XML/?response_type=display

if that would be any easier

Note: also tried with R package rpubchem, but that one only seems to import a small amount of the available info:

library("rpubchem")
get.cid(5282108)
CID  IUPACName CanonicalSmile MolecularFormula MolecularWeight TotalFormalCharge XLogP HydrogenBondDonorCount HydrogenBondAcceptorCount HeavyAtomCount    TPSA
2 5282108 (E)-4-(2,6,6-trimethylcyclohex-2-en-1-yl)but-3-en-2-one        C13H20O       192.297300               0                 3     0                      1                        14             17 5282108

Solution

My proposal works on XML files, because (thanks to XPath) I find them more convenient to traverse and select nodes.

Please note that this is neither fast (took few seconds while testing) nor optimal (I parse each file twice - once for names and the like and once for Kovats Retention Index). But I guess that you will want to parse some set of files once and go ahead with your real business, and premature optimization is the root of all evil.

I have put main tasks into separate functions. If you want to get data for one specific pubchem record, they are ready to use. But if you want to get data from few pubchem records at once, you can define vector of pointers to data and use examples at the bottom to merge results together. In my case, vector contains paths to files on my local disk. URLs are supported as well, although I would discourage them (remember that each site will be requested twice, and if there is greater number of records, you probably want to handle faulty network somehow).

Compound you have linked to has multiple entries on "EC Number". They do differ by ReferenceNumber, but not by Name. I wasn't sure why it is that way and what should I do with it (your sample output contains only one entry for EC Number), so I left this to R. R added suffixes to duplicated values and created EC.Number.1, EC.Number.2 etc. These suffixes do not match with ReferenceNumber in file and probably the same column in master data frame will refer to different ReferenceNumbers for different compounds.

It seems that pubchem uses following format for tags <type>Value[List]. In few places I have hardcoded StringValue, but maybe some compound has different types in the same fields. I usually haven't considered lists, except where it was requested. So further modifications might be needed as more data is thrown at this code.

If you have any questions, please post them in comments. I am not sure whether I should explain that code or what.

library("xml2")
library("data.table")

compound.attributes <- function(file=NULL) {
  compound <- read_xml(file)
  ns <- xml_ns(compound)
  information <- xml_find_all(compound, paste0(
    "//d1:TOCHeading[text()='Computed Descriptors'",
    " or text()='Other Identifiers'",
    " or text()='Synonyms'",
    " or text()='Computed Properties']",
    "/following-sibling::d1:Section/d1:Information"
  ), ns)

  properties <- sapply(information, function(x) {
    name <- xml_text(xml_find_one(x, "./d1:Name", ns))
    value <- ifelse(length(xml_find_all(x, "./d1:StringValueList", ns)) > 0,
                    paste(sapply(
                      xml_find_all(x, "./d1:StringValueList", ns),
                      xml_text, trim=TRUE), sep="", collapse="|"),
                    xml_text(
                      xml_find_one(x, "./*[contains(name(),'Value')]", ns),
                      trim=TRUE)
    )
    names(value) <- name
    return(value)
  })
  rm(compound, information)
  properties <- as.list(properties)
  properties$pubchemid <- sub(".*/([0-9]+)/?.*", "\\1", file)
  return(data.frame(properties))
}

compound.retention.index <- function(file=NULL) {
  pubchemid <- sub(".*/([0-9]+)/?.*", "\\1", file)
  compound <- read_xml(file)
  ns <- xml_ns(compound)
  information <- xml_find_all(compound, paste0(
    "//d1:TOCHeading[text()='Kovats Retention Index']",
    "/following-sibling::d1:Information"
  ), ns)
  indexes <- lapply(information, function(x) {
    name <- xml_text(xml_find_one(x, "./d1:Name", ns))
    values <- as.numeric(sapply(
      xml_find_all(x, "./*[contains(name(), 'NumValue')]", ns), 
      xml_text))

    data.frame(pubchemid=pubchemid,
               column_class=name,
               kovats_ri=values)
  })

  return( do.call("rbind", indexes) )
}

compounds <- c("./5282108.xml", "./5282148.xml", "./91754124.xml")

cd <- rbindlist(
  lapply(compounds, compound.attributes),
  fill=TRUE
)

rti <- do.call("rbind",
               lapply(compounds, compound.retention.index))