rjsonparsingjsonliterjson

How can I parse both of these JSON files in R?


I am attempting to parse JSON files found on the web. From the same source, some JSONs can be parsed and others cannot. What is the difference between these JSONs? How can I parse them both? This is part of a larger script which loops through a list of jsons.

library(rjson)
library(jsonlite)

file1<-"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/24750/JSON/?response_type=display.json"
file2<-"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/177/JSON/?response_type=display.json"

alidUTF8(file1) #TRUE
alidUTF8(file2) #TRUE

parsed2<-rjson::fromJSON(file=file2) #outputs Large list
parsed1<-rjson::fromJSON(file=file1) #ERROR

Error in rjson::fromJSON(file = file1) :
attempt to set index 1/1 in SET_STRING_ELT

In addition: Warning message:
In readLines(file, warn = FALSE) : invalid input found on input connection
'https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/24750/JSON/?response_type=display.json'

I tried switching to the jsonlite package without success.

parsed1<-jsonlite::fromJSON(txt=file1) #ERROR

Error in parse_con(txt, bigint_as_char) :
lexical error: invalid bytes in UTF8 string.
an average alkyl chain of 12�??14 carbon atoms, and an ethyl
(right here) ------^

I also tried readLines without success.

parsed1<-jsonlite::fromJSON(readLines(file1, warn=FALSE)) #ERROR

Error: parse error: premature EOF
{ "Record": { "RecordType
(right here) ------^

In addition: Warning message:
In readLines(request, warn = FALSE) : invalid input found on input connection
'https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/24750/JSON/?response_type=display.json'


Solution

  • In parsed2<-jsonlite::fromJSON(readLines(file2, warn=FALSE)) readLines reads every line of the json text as a vector element. But fromJSON excepts one JSON string, throwing the EOF error. So you need to collapse the seperated strings procuded by readLines. Now to your problem: jsonlite::fromJSON says that there are invalid bytes in UTF8 string as nicely explained by Allan's answer. You can use the iconv-trick to convert UTF8 to UTF8 and replace instances where this fails with " " with sub like

    read_broken_json <- function(url){readLines(url, warn = FALSE, encoding = "UTF-8") |> paste(collapse = "") |> iconv(from = "UTF-8", to = "UTF-8", sub = " ") |> rjson::fromJSON()}
    

    and then using it on your file links

    res1 <- read_broken_json(file1)
    

    This replaces broken strings with " " which you need to keep in mind. Note: Both readLines and fromJSON can use links to retrieve / download data.