xmlrweb-scrapingmechanizedoi

Given a table of citations, how to reverse-lookup the Digital Object Identifier for each of the citations?


I have a table of citations that includes the last name of the first author, the title, journal, year, and page numbers for each citation.

I have posted the first few lines of the table on Google Docs; it is also available in the form of a CSV file. (Notice that some records do not have a DOI.)

I would like to be able to query the DOI for each of these citations. For the titles, it would be best if the query could handle some form of fuzzy matching.

How can I do this?

The table is currently in MySQL, but it would be sufficient to start and end with a CSV file or, since I mostly use R, an R data frame. (I would appreciate an answer that goes from start to finish.)


Solution

  • Here are two options

    CSV upload

    I have found another promising solution that does not work as well in practice as uploading a CSV directly and performing a text query here at http://www.crossref.org/stqUpload/.

    However, only 18 of the 250 queries (≈7%) returned a DOI.

    XML Query

    Based on the answer by Brian Diggs, here is an attempt in the R programming language that does 95% of the work—toward writing the XML-based query. It still has a few bugs that need to be removed using sed. But the biggest problem is the “session timed out” errors I had encountered when the query was submitted.

    The XML syntax includes an option to use fuzzy matching.

    The doiquery.xml file contains the template text from Brian’s answer; the citations.csv file is linked above.

    library(XML)
    doiquery.xml <- xmlTreeParse('doiquery.xml')
    
    query <- doiquery.xml$doc$children$query_batch[["body"]]
    
    citations <- read.csv("citations.csv")
    
    new.query <- function(citation, query = query){
      xmlValue(query[["author"]]) <- as.character(citation$author)
      xmlValue(query[["year"]]) <- as.character(citation$year)
      xmlValue(query[["article_title"]][["text"]]) <- citation$title
      xmlValue(query[["journal_title"]]) <- citation$journal
      return(query)
    }
    
    for (i in 1:nrow(citations)){
      q <- addChildren(q, add.query(citations[i,]))
    }
    axml <- addChildren(doiquery.xml$doc$children$query_batch, q )
    
    saveXML(axml, file = 'foo.xml')
    

    CSV to XML Converter

    Creativyst software provides a Web-based CSV to XML converter.

    The necessary steps to take are as follows.

    1. Enter the column names in the ElementIDs field.
    2. Enter document in the DocID field.
    3. Enter query in RowID field.
    4. Copy and paste the CSV file into the Input CSV file field.
    5. Click Convert.

    See also a related question: Shell script to parse CSV to an XML query?