I have a fairly practical question, where it's hard to provide a regex - sorry for that. So I try to explain it properly.
A script connects to a AWS s3 bucket with the aws.s3 package. In that bucket there are .gz-files which contain JSON. Unfortunately some lines - not all - contain a bug in JSON-Format. They end with }]]} instead of }]}.
So I try to find an R-way to find and replace the pattern before unpacking the JSON-Object fails. A non-working line is already inserted (# gsub()) which represents a lucky guess to fix that thing.
What would be your solution?
data_i <- aws.s3::get_object(
object = objectname_i,
bucket = bucketname_i,
region = "eu-central-1",
as = "raw"
) %>%
rawConnection() |>
gzcon() |>
# gsub("}]]}", "}]]}") |>
jsonlite::stream_in()
I found following solution: After setting up a connection, I use gzcon() for unpacking - as before. Now I read in the lines (readLines()) over the connection and have the data in R.
Now I can operate on the R object with gsub().
After that I want to use stream_in() again, and open therefore a textConnection(). As a result I have the data.frame s3ObjectDataframe
s3ObjectUnpacked <- aws.s3::get_object(
object = objectname_i,
bucket = bucketname_i,
region = "eu-central-1",
as = "raw"
) |>
rawConnection() |>
gzcon()
s3ObjectPerLine <- readLines(s3ObjectUnpacked)
s3ObjectCorrected <- gsub("}]]}", "}]}", s3ObjectPerLine)
s3ObjectDataframe <- jsonlite::stream_in(textConnection(gsub("\\n", "", s3ObjectCorrected)))