rjsontidyrunnest

how to unnest json string from data frame in R?


Hi and thanks for reading me. I have a data frame that have the next structure:

# A tibble: 6 × 2
  customerId transactionHistory                                                                                          
       <dbl> <chr>                                                                                                       
1      40895 [{'transactionId': 0, 'date': '2024-05-04', 'status': 'SUCCESSFUL', 'transactionAmount': '$40.00'}, {'trans…
2      42966 [{'transactionId': 0, 'date': '2024-05-21', 'status': 'SUCCESSFUL', 'transactionAmount': '$40.00'}, {'trans…
3      51806 [{'transactionId': 0, 'date': '2024-04-22', 'status': 'SUCCESSFUL', 'transactionAmount': '$40.00'}, {'trans…
4      53366 [{'transactionId': 0, 'date': '2024-05-23', 'status': 'SUCCESSFUL', 'transactionAmount': '$40.00'}, {'trans…
5      53632 [{'transactionId': 0, 'date': '2024-04-24', 'status': 'SUCCESSFUL', 'transactionAmount': '$40.00'}, {'trans…
6      53871 [{'transactionId': 0, 'date': '2024-04-18', 'status': 'SUCCESSFUL', 'transactionAmount': '$40.00'}]   

And I want to unnest the data to be able on get a bigger table with all the transactions in a properly data frame format, but I dont be able to find a way to do that, because the transactionHistory column is a string, not a json.

Does anyone knows a way to do that?


Solution

  • Here is one possible approach which first replace single quotes in the JSON string by double quotes (see Single vs double quotes in JSON), then uses lapply and jsonlite::fromJSON to convert to a data.frame and finally unnests the column of transaction histories.

    Using a minified reproducible example:

    dat <- tibble::tibble(
      customerId = c(40895, 42966),
      transactionHistory = c(
        "[{'transactionId': 0, 'date': '2024-05-04', 'status': 'SUCCESSFUL', 'transactionAmount': '$40.00'}, {'transactionId': 1, 'date': '2024-05-04', 'status': 'SUCCESSFUL', 'transactionAmount': '$40.00'}]",
        "[{'transactionId': 2, 'date': '2024-05-04', 'status': 'SUCCESSFUL', 'transactionAmount': '$40.00'}, {'transactionId': 3, 'date': '2024-05-04', 'status': 'SUCCESSFUL', 'transactionAmount': '$40.00'}]"
      )
    )
    
    dat
    #> # A tibble: 2 × 2
    #>   customerId transactionHistory                                                 
    #>        <dbl> <chr>                                                              
    #> 1      40895 [{'transactionId': 0, 'date': '2024-05-04', 'status': 'SUCCESSFUL'…
    #> 2      42966 [{'transactionId': 2, 'date': '2024-05-04', 'status': 'SUCCESSFUL'…
    
    library(dplyr, warn = FALSE)
    library(tidyr)
    library(jsonlite)
    
    dat |>
      mutate(
        # Replace ' by \"
        transactionHistory = gsub("\\'", "\"", transactionHistory),
        # Convert to dataframe
        transactionHistory = lapply(transactionHistory, jsonlite::fromJSON)
      ) |>
      tidyr::unnest(transactionHistory)
    #> # A tibble: 4 × 5
    #>   customerId transactionId date       status     transactionAmount
    #>        <dbl>         <int> <chr>      <chr>      <chr>            
    #> 1      40895             0 2024-05-04 SUCCESSFUL $40.00           
    #> 2      40895             1 2024-05-04 SUCCESSFUL $40.00           
    #> 3      42966             2 2024-05-04 SUCCESSFUL $40.00           
    #> 4      42966             3 2024-05-04 SUCCESSFUL $40.00