rjsonperformance

Formatting for JSON lite


I have a column with hundreds of millions of rows. A sample row would be:

x <- "{some field=249 apples, y= m s 33 , url=https://go.s.com?id=7, source=multiC}"

# use cat(x) to print

I would like to get the column to be ready so it looks like below:

"{"some field":"249 apples", "y":" m s 33 ", "url":"https://go.s.com?id=7", "source":"multiC"}"

to be fed into below code:

y <- sprintf("[%s]"             # wrap in square brackets
               , toString(x)
               ) |>
          jsonlite::fromJSON()

Given the untidy (semi-structure) nature of the column what is the simplest (and performant) regex (supposedly using gsub) to achieve this?

Edit:

2 more rows of real world data

x <- data.table(id = c(1,2,3)
                , string = c('{some field=249 apples, y= m s 33 , url=https://go.s.com?id=7, source=multiC}'
                             , '{url=https://www.gov.uk/government/publications/damp-and-mould-understanding-and-addressing-the-health-risks-for-rented-housing-providers/understanding-and-addressing-the-health-risks-of-damp-and-mould-in-the-home--2#:~:text=protect%20tenant%20health.-,Respiratory%20effects,wheeze%20and%20shortness%20of%20breath&text=increased%20risk%20of%20airway%20infections,airways%20with%20the%20fungus%20Aspergillus)&text=development%20or%20worsening%20of%20allergic,obstructive%20pulmonary%20disease%20(%20COPD%20))}'
                             , '{x = 292, y = 1029, url = https://go.skimresources.com?id=76202X1528716&xs=1&url=https%3A%2F%2Fwww.marksandspencer.com%2Fbuckle-detail-faux-fur-jacket%2Fp%2Fclp60700989%23intid%3Dpid_pg1pip48g4r3c2&sref=https%3A%2F%2Fwww.liverpoolecho.co.uk%2Fwhats-on%2Fshopping%2Fmarks--spencers-best-fur-30733210}'
                             )
                )

I would like to end up with a new dataframe cbind to the original. This new data frame would have no. of columns equal to the no. of distinct keys inside the JSON column string.


Solution

  • Gsub is sort of fast:

    # one row
    x <- rep("{some field=249 apples, y= m s 33 , url=https://go.s.com?id=7, source=multiC}", 1000000) # repeat for 1 mil rows for testing
    
    reform <- function(x) {
      gsub('([a-zA-Z0-9_ ]+)=([^,}]+)', '"\\1":"\\2"', 
           gsub('([{,])\\s*|\\s*([,}])', '\\1\\2', x))
    }
    

    But stringr::str_replace_all is faster:

    library(stringr)
    
    reform_stringr <- function(x) {
      x <- str_replace_all(x, '([{,])\\s*|\\s*([,}])', '\\1\\2')
      str_replace_all(x, '([a-zA-Z0-9_ ]+)=([^,}]+)', '"\\1":"\\2"')
    }
    

    Test Results for 1 mil rows

    Unit: seconds
               expr      min       lq     mean   median       uq      max neval cld
             reform 6.168293 6.177294 6.236553 6.211335 6.265154 6.379521    10  a 
     reform_stringr 3.974893 3.990187 3.997749 3.994628 3.997775 4.040163    10   b
    

    Use it like this

    y <- sprintf("[%s]", toString(reform_stringr(x))) |>
      jsonlite::fromJSON()
    
    > head(y)
    
    some field y url source
    249 apples m s 33 https://go.s.com?id=7 multiC
    249 apples m s 33 https://go.s.com?id=7 multiC
    249 apples m s 33 https://go.s.com?id=7 multiC
    249 apples m s 33 https://go.s.com?id=7 multiC
    249 apples m s 33 https://go.s.com?id=7 multiC
    249 apples m s 33 https://go.s.com?id=7 multiC