I am attempting to parse JSON files found on the web. From the same source, some JSONs can be parsed and others cannot. What is the difference between these JSONs? How can I parse them both? This is part of a larger script which loops through a list of jsons.
library(rjson)
library(jsonlite)
file1<-"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/24750/JSON/?response_type=display.json"
file2<-"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/177/JSON/?response_type=display.json"
alidUTF8(file1) #TRUE
alidUTF8(file2) #TRUE
parsed2<-rjson::fromJSON(file=file2) #outputs Large list
parsed1<-rjson::fromJSON(file=file1) #ERROR
Error in rjson::fromJSON(file = file1) :
attempt to set index 1/1 in SET_STRING_ELT
In addition: Warning message:
In readLines(file, warn = FALSE) : invalid input found on input connection
'https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/24750/JSON/?response_type=display.json'
I tried switching to the jsonlite
package without success.
parsed1<-jsonlite::fromJSON(txt=file1) #ERROR
Error in parse_con(txt, bigint_as_char) :
lexical error: invalid bytes in UTF8 string.
an average alkyl chain of 12�??14 carbon atoms, and an ethyl
(right here) ------^
I also tried readLines
without success.
parsed1<-jsonlite::fromJSON(readLines(file1, warn=FALSE)) #ERROR
Error: parse error: premature EOF
{ "Record": { "RecordType
(right here) ------^
In addition: Warning message:
In readLines(request, warn = FALSE) : invalid input found on input connection
'https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/24750/JSON/?response_type=display.json'
In parsed2<-jsonlite::fromJSON(readLines(file2, warn=FALSE))
readLines
reads every line of the json text as a vector element. But fromJSON excepts one JSON string, throwing the EOF error. So you need to collapse the seperated strings procuded by readLines. Now to your problem: jsonlite::fromJSON
says that there are invalid bytes in UTF8 string as nicely explained by Allan's answer. You can use the iconv-trick to convert UTF8 to UTF8 and replace instances where this fails with " " with sub
like
read_broken_json <- function(url){readLines(url, warn = FALSE, encoding = "UTF-8") |> paste(collapse = "") |> iconv(from = "UTF-8", to = "UTF-8", sub = " ") |> rjson::fromJSON()}
and then using it on your file links
res1 <- read_broken_json(file1)
This replaces broken strings with " " which you need to keep in mind. Note: Both readLines and fromJSON can use links to retrieve / download data.