rubyregexrweb-scrapingpubchem

CAS registry to Pubchem cid identifier conversion in R


An annoying problem many chemists are faced with is to convert CAS registry numbers of chemical compounds (stored in some commercial database that is not readily accessible) to Pubchem identifiers (openly available). Pubchem kind of supports conversion between the two, but only through their manual web interface, and not their official PUG REST programmatic interface.

A solution in Ruby is given here, based on the e-utilities interface: http://depth-first.com/articles/2007/09/13/hacking-pubchem-convert-cas-numbers-into-pubchem-cids-with-ruby/

Does anybody know how this would translate into R?

EDIT: based on the answerbelow, the most elegant solution is:

library(XML)
library(RCurl)

CAStocids=function(query) {
  xmlresponse = xmlParse( getURL(paste("http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pccompound&retmax=100&term=",query,sep="") ) )
  cids = sapply(xpathSApply(xmlresponse, "//Id"), function(n){xmlValue(n)})
  return(cids)
}

> CAStocids("64318-79-2")
[1] "6434870" "5282237"

cheers, Tom


Solution

  • This how the Ruby code does it, translated to R, uses RCurl and XML:

    > xmlresponse = xmlParse( getURL("http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pccompound&retmax=100&term=64318-79-2") )
    

    and here's how to extract the Id nodes:

    > sapply(xpathSApply(xmlresponse, "//Id"), function(n){xmlValue(n)})
     [1] "6434870" "5282237"
    

    wrap all that in a function....

     convertU = function(query){
        xmlresponse = xmlParse(getURL(
           paste0("http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pccompound&retmax=100&term=",query))) 
        sapply(xpathSApply(xmlresponse, "//Id"), function(n){xmlValue(n)})
     }
    
    > convertU("64318-79-2")
    [1] "6434870" "5282237"
    > convertU("64318-79-1")
    list()
    > convertU("64318-78-2")
    list()
    > convertU("64313-78-2")
    [1] "313"
    

    maybe needs a test if not found.