rdataframedplyrserializationduckdb

Serialize/deserialize a column with R and DuckDB


tldr:

I finally found the solution for my specific case thanks to @r2evans answer below:

When writing the table:

dplyr::mutate(x = sapply(.data$x, serialize, connection = NULL))

When reading the table:

dplyr::mutate(x = lapply(.data$x, \(x) htmltools::HTML(unserialize(x))))


I am trying to serialize a column in a dataframe prior to writing to a database (since the original data-type is not compatible with the database).

I then read the table back from the database and attempt to deserialize the column back to the original data, however I am unable to do this.

Reproducible example:

df <- tibble::tibble(x = 1:5, y = letters[1:5])

# A tibble: 5 × 2
      x y    
  <int> <chr>
1     1 a    
2     2 b    
3     3 c    
4     4 d    
5     5 e    
input <- df |> dplyr::mutate(x = lapply(x, serialize, NULL))

# A tibble: 5 × 2
  x          y    
  <list>     <chr>
1 <raw [35]> a    
2 <raw [35]> b    
3 <raw [35]> c    
4 <raw [35]> d    
5 <raw [35]> e  
conn <- duckdb::dbConnect(duckdb::duckdb(), ":memory:")

duckdb::dbWriteTable(conn, "df", input)

output <- DBI::dbGetQuery(conn, "SELECT * FROM df") |> tibble::as_tibble()

# A tibble: 5 × 2    
  x          y    
  <list>     <chr>
1 <raw [35]> a    
2 <raw [35]> b    
3 <raw [35]> c    
4 <raw [35]> d    
5 <raw [35]> e   

The same (serialized) data is returned from DuckDB, however I cannot deserialize it:

output |> dplyr::mutate(x = lapply(x, unserialize))

# A tibble: 5 × 2
  x         y    
  <list>    <chr>
1 <int [1]> a    
2 <int [1]> b    
3 <int [1]> c    
4 <int [1]> d    
5 <int [1]> e  

Any help would be greatly appreciated!

Edit:

The actual data contained in the column that I serialize is essentially a html string, in fact it shows:

# A tibble: 5 × 2    
  x      y    
  <list> <chr>
1 <html> a    
2 <html> b    
3 <html> c    
4 <html> d    
5 <html> e  

After running dplyr::mutate(x = lapply(x, unserialize)), I get back:

# A tibble: 5 × 2    
  x      y    
  <list> <chr>
1 <raw>  a    
2 <raw>  b    
3 <raw>  c    
4 <raw>  d     
5 <raw>  e  

Further info:

This is the data before and after:

> before[1]
[[1]]
<a href='data:text/csv;base64,Zm9v' download='foo.csv'><button style='background-color:#67C2DC; color:white; border:none; padding:5px;font-weight:bold; cursor:pointer; border-radius:4px; font-size:12px;'>CSV</button></a>

> after[1]
[[1]]
   [1] 58 0a 00 00 00 03 00 04 04 01 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00 00 18 00 02 8c 7a 58 0a 00 00 00 03
...
 [ reached getOption("max.print") -- omitted 166065 entries ]

> class(before)
[1] "list"
> class(before[1])
[1] "list"
> class(before[1][[1]])
[1] "html"      "character"

> class(after)
[1] "list"
> class(after[1])
[1] "list"
> class(after[1][[1]])
[1] "raw"

I have tried all of the below suggestions with no luck. I consistently end up with <raw> after deserialising.

Additionally:

> unserialize(after[1][[1]])
   [1] 58 0a 00 00 00 03 00 04 04 01 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00 00 18 00 02 8c 5b 58 0a 00 00 00 03
  ...

Solution

  • This is just about "list-columns". Usually frames' columns are vectors of simple scalars (perhaps an over-reduction). However, it is certainly possible to include more complex objects in each "cell" of a column, this is done via "list columns", where instead of a vector (in which all objects must be the same class), we use a list (in which each cell may be length-0, length-1, more complex objects, just about anything).

    If you are certain that all data in output$x reduces to a length-1 scalar, then you may be able to replace lapply (which always returns a list) with sapply:

    output |> dplyr::mutate(x = sapply(x, unserialize))
    # # A tibble: 5 × 2
    #       x y    
    #   <int> <chr>
    # 1     1 a    
    # 2     2 b    
    # 3     3 c    
    # 4     4 d    
    # 5     5 e    
    

    This is not always safe, however. For a demonstration of when sapply may not be desired,

    now <- Sys.time()
    now + 0
    # [1] "2025-03-06 14:45:43 EST"
    sapply(now, `+`, 0)
    # [1] 1741290344
    do.call(c, lapply(now, `+`, 0))
    # [1] "2025-03-06 14:45:43 EST"
    

    So if you are certain that all the data in output$x reduces to a length-1 scalar and you may have attributes that sapply can dump, do this:

    output |> dplyr::mutate(x = do.call(c, lapply(x, unserialize)))
    

    If any of output$x may be length greater than 1, then consider tidyr::unnest:

    output |>
      dplyr::mutate(x = lapply(x, unserialize)) |>
      unnest(cols = x)
    

    Note that if any of output$x may be length-0, NULL, or similar, then the rows that contain those empty x values will be removed in the results. For a demonstration, see

    set.seed(43)
    output |>
      dplyr::mutate(x = lapply(x, function(z) if (runif(1) < 0.5) unserialize(z))) |>
      tidyr::unnest(cols = x)
    # # A tibble: 3 × 2
    #       x y    
    #   <int> <chr>
    # 1     1 a    
    # 2     3 c    
    # 3     5 e    
    

    To better deal with those, I suggest something like this:

    set.seed(43)
    output |>
      dplyr::mutate(
        x = lapply(x, function(z) if (runif(1) < 0.5) unserialize(z)),
        # additional step to convert length-0 to length-1 NA
        x = lapply(x, function(z) if (is.null(z)) NA else z)
      ) |>
      tidyr::unnest(cols = x)
    # # A tibble: 5 × 2
    #       x y    
    #   <int> <chr>
    # 1     1 a    
    # 2    NA b    
    # 3     3 c    
    # 4    NA d    
    # 5     5 e