I have a column with hundreds of millions of rows. A sample row would be:
x <- "{some field=249 apples, y= m s 33 , url=https://go.s.com?id=7, source=multiC}"
# use cat(x) to print
I would like to get the column to be ready so it looks like below:
"{"some field":"249 apples", "y":" m s 33 ", "url":"https://go.s.com?id=7", "source":"multiC"}"
to be fed into below code:
y <- sprintf("[%s]" # wrap in square brackets
, toString(x)
) |>
jsonlite::fromJSON()
Given the untidy (semi-structure) nature of the column what is the simplest (and performant) regex (supposedly using gsub
) to achieve this?
Edit:
2 more rows of real world data
x <- data.table(id = c(1,2,3)
, string = c('{some field=249 apples, y= m s 33 , url=https://go.s.com?id=7, source=multiC}'
, '{url=https://www.gov.uk/government/publications/damp-and-mould-understanding-and-addressing-the-health-risks-for-rented-housing-providers/understanding-and-addressing-the-health-risks-of-damp-and-mould-in-the-home--2#:~:text=protect%20tenant%20health.-,Respiratory%20effects,wheeze%20and%20shortness%20of%20breath&text=increased%20risk%20of%20airway%20infections,airways%20with%20the%20fungus%20Aspergillus)&text=development%20or%20worsening%20of%20allergic,obstructive%20pulmonary%20disease%20(%20COPD%20))}'
, '{x = 292, y = 1029, url = https://go.skimresources.com?id=76202X1528716&xs=1&url=https%3A%2F%2Fwww.marksandspencer.com%2Fbuckle-detail-faux-fur-jacket%2Fp%2Fclp60700989%23intid%3Dpid_pg1pip48g4r3c2&sref=https%3A%2F%2Fwww.liverpoolecho.co.uk%2Fwhats-on%2Fshopping%2Fmarks--spencers-best-fur-30733210}'
)
)
I would like to end up with a new dataframe cbind to the original.
This new data frame would have no. of columns equal to the no. of distinct keys inside the JSON column string
.
Gsub is sort of fast:
# one row
x <- rep("{some field=249 apples, y= m s 33 , url=https://go.s.com?id=7, source=multiC}", 1000000) # repeat for 1 mil rows for testing
reform <- function(x) {
gsub('([a-zA-Z0-9_ ]+)=([^,}]+)', '"\\1":"\\2"',
gsub('([{,])\\s*|\\s*([,}])', '\\1\\2', x))
}
But stringr::str_replace_all
is faster:
library(stringr)
reform_stringr <- function(x) {
x <- str_replace_all(x, '([{,])\\s*|\\s*([,}])', '\\1\\2')
str_replace_all(x, '([a-zA-Z0-9_ ]+)=([^,}]+)', '"\\1":"\\2"')
}
Unit: seconds
expr min lq mean median uq max neval cld
reform 6.168293 6.177294 6.236553 6.211335 6.265154 6.379521 10 a
reform_stringr 3.974893 3.990187 3.997749 3.994628 3.997775 4.040163 10 b
y <- sprintf("[%s]", toString(reform_stringr(x))) |>
jsonlite::fromJSON()
> head(y)
some field | y | url | source |
---|---|---|---|
249 apples | m s 33 | https://go.s.com?id=7 | multiC |
249 apples | m s 33 | https://go.s.com?id=7 | multiC |
249 apples | m s 33 | https://go.s.com?id=7 | multiC |
249 apples | m s 33 | https://go.s.com?id=7 | multiC |
249 apples | m s 33 | https://go.s.com?id=7 | multiC |
249 apples | m s 33 | https://go.s.com?id=7 | multiC |