Formatting for JSON lite

I have a column with hundreds of millions of rows. A sample row would be:

x <- "{some field=249 apples, y= m s 33 , url=https://go.s.com?id=7, source=multiC}"

# use cat(x) to print

I would like to get the column to be ready so it looks like below:

"{"some field":"249 apples", "y":" m s 33 ", "url":"https://go.s.com?id=7", "source":"multiC"}"

to be fed into below code:

y <- sprintf("[%s]"             # wrap in square brackets
               , toString(x)
               ) |>
          jsonlite::fromJSON()

Given the untidy (semi-structure) nature of the column what is the simplest (and performant) regex (supposedly using gsub) to achieve this?

Edit:

2 more rows of real world data

x <- data.table(id = c(1,2,3)
                , string = c('{some field=249 apples, y= m s 33 , url=https://go.s.com?id=7, source=multiC}'
                             , '{url=https://www.gov.uk/government/publications/damp-and-mould-understanding-and-addressing-the-health-risks-for-rented-housing-providers/understanding-and-addressing-the-health-risks-of-damp-and-mould-in-the-home--2#:~:text=protect%20tenant%20health.-,Respiratory%20effects,wheeze%20and%20shortness%20of%20breath&text=increased%20risk%20of%20airway%20infections,airways%20with%20the%20fungus%20Aspergillus)&text=development%20or%20worsening%20of%20allergic,obstructive%20pulmonary%20disease%20(%20COPD%20))}'
                             , '{x = 292, y = 1029, url = https://go.skimresources.com?id=76202X1528716&xs=1&url=https%3A%2F%2Fwww.marksandspencer.com%2Fbuckle-detail-faux-fur-jacket%2Fp%2Fclp60700989%23intid%3Dpid_pg1pip48g4r3c2&sref=https%3A%2F%2Fwww.liverpoolecho.co.uk%2Fwhats-on%2Fshopping%2Fmarks--spencers-best-fur-30733210}'
                             )
                )

I would like to end up with a new dataframe cbind to the original. This new data frame would have no. of columns equal to the no. of distinct keys inside the JSON column string.

Solution

Gsub is sort of fast:

# one row
x <- rep("{some field=249 apples, y= m s 33 , url=https://go.s.com?id=7, source=multiC}", 1000000) # repeat for 1 mil rows for testing

reform <- function(x) {
  gsub('([a-zA-Z0-9_ ]+)=([^,}]+)', '"\\1":"\\2"', 
       gsub('([{,])\\s*|\\s*([,}])', '\\1\\2', x))
}

But stringr::str_replace_all is faster:

library(stringr)

reform_stringr <- function(x) {
  x <- str_replace_all(x, '([{,])\\s*|\\s*([,}])', '\\1\\2')
  str_replace_all(x, '([a-zA-Z0-9_ ]+)=([^,}]+)', '"\\1":"\\2"')
}

Test Results for 1 mil rows

Unit: seconds
           expr      min       lq     mean   median       uq      max neval cld
         reform 6.168293 6.177294 6.236553 6.211335 6.265154 6.379521    10  a 
 reform_stringr 3.974893 3.990187 3.997749 3.994628 3.997775 4.040163    10   b

Use it like this

y <- sprintf("[%s]", toString(reform_stringr(x))) |>
  jsonlite::fromJSON()

> head(y)

some field	y	url	source
249 apples	m s 33	https://go.s.com?id=7	multiC
249 apples	m s 33	https://go.s.com?id=7	multiC
249 apples	m s 33	https://go.s.com?id=7	multiC
249 apples	m s 33	https://go.s.com?id=7	multiC
249 apples	m s 33	https://go.s.com?id=7	multiC
249 apples	m s 33	https://go.s.com?id=7	multiC