tldr:
I finally found the solution for my specific case thanks to @r2evans answer below:
When writing the table:
dplyr::mutate(x = sapply(.data$x, serialize, connection = NULL))
When reading the table:
dplyr::mutate(x = lapply(.data$x, \(x) htmltools::HTML(unserialize(x))))
I am trying to serialize a column in a dataframe prior to writing to a database (since the original data-type is not compatible with the database).
I then read the table back from the database and attempt to deserialize the column back to the original data, however I am unable to do this.
Reproducible example:
df <- tibble::tibble(x = 1:5, y = letters[1:5])
# A tibble: 5 × 2
x y
<int> <chr>
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
input <- df |> dplyr::mutate(x = lapply(x, serialize, NULL))
# A tibble: 5 × 2
x y
<list> <chr>
1 <raw [35]> a
2 <raw [35]> b
3 <raw [35]> c
4 <raw [35]> d
5 <raw [35]> e
conn <- duckdb::dbConnect(duckdb::duckdb(), ":memory:")
duckdb::dbWriteTable(conn, "df", input)
output <- DBI::dbGetQuery(conn, "SELECT * FROM df") |> tibble::as_tibble()
# A tibble: 5 × 2
x y
<list> <chr>
1 <raw [35]> a
2 <raw [35]> b
3 <raw [35]> c
4 <raw [35]> d
5 <raw [35]> e
The same (serialized) data is returned from DuckDB, however I cannot deserialize it:
output |> dplyr::mutate(x = lapply(x, unserialize))
# A tibble: 5 × 2
x y
<list> <chr>
1 <int [1]> a
2 <int [1]> b
3 <int [1]> c
4 <int [1]> d
5 <int [1]> e
Any help would be greatly appreciated!
Edit:
The actual data contained in the column that I serialize is essentially a html string, in fact it shows:
# A tibble: 5 × 2
x y
<list> <chr>
1 <html> a
2 <html> b
3 <html> c
4 <html> d
5 <html> e
After running dplyr::mutate(x = lapply(x, unserialize))
, I get back:
# A tibble: 5 × 2
x y
<list> <chr>
1 <raw> a
2 <raw> b
3 <raw> c
4 <raw> d
5 <raw> e
Further info:
This is the data before and after:
> before[1]
[[1]]
<a href='data:text/csv;base64,Zm9v' download='foo.csv'><button style='background-color:#67C2DC; color:white; border:none; padding:5px;font-weight:bold; cursor:pointer; border-radius:4px; font-size:12px;'>CSV</button></a>
> after[1]
[[1]]
[1] 58 0a 00 00 00 03 00 04 04 01 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00 00 18 00 02 8c 7a 58 0a 00 00 00 03
...
[ reached getOption("max.print") -- omitted 166065 entries ]
> class(before)
[1] "list"
> class(before[1])
[1] "list"
> class(before[1][[1]])
[1] "html" "character"
> class(after)
[1] "list"
> class(after[1])
[1] "list"
> class(after[1][[1]])
[1] "raw"
I have tried all of the below suggestions with no luck. I consistently end up with <raw>
after deserialising.
Additionally:
> unserialize(after[1][[1]])
[1] 58 0a 00 00 00 03 00 04 04 01 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00 00 18 00 02 8c 5b 58 0a 00 00 00 03
...
This is just about "list-columns". Usually frames' columns are vectors of simple scalars (perhaps an over-reduction). However, it is certainly possible to include more complex objects in each "cell" of a column, this is done via "list columns", where instead of a vector (in which all objects must be the same class), we use a list (in which each cell may be length-0, length-1, more complex objects, just about anything).
If you are certain that all data in output$x
reduces to a length-1 scalar, then you may be able to replace lapply
(which always returns a list
) with sapply
:
output |> dplyr::mutate(x = sapply(x, unserialize))
# # A tibble: 5 × 2
# x y
# <int> <chr>
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
# 5 5 e
This is not always safe, however. For a demonstration of when sapply
may not be desired,
now <- Sys.time()
now + 0
# [1] "2025-03-06 14:45:43 EST"
sapply(now, `+`, 0)
# [1] 1741290344
do.call(c, lapply(now, `+`, 0))
# [1] "2025-03-06 14:45:43 EST"
So if you are certain that all the data in output$x
reduces to a length-1 scalar and you may have attributes that sapply
can dump, do this:
output |> dplyr::mutate(x = do.call(c, lapply(x, unserialize)))
If any of output$x
may be length greater than 1, then consider tidyr::unnest
:
output |>
dplyr::mutate(x = lapply(x, unserialize)) |>
unnest(cols = x)
Note that if any of output$x
may be length-0, NULL
, or similar, then the rows that contain those empty x
values will be removed in the results. For a demonstration, see
set.seed(43)
output |>
dplyr::mutate(x = lapply(x, function(z) if (runif(1) < 0.5) unserialize(z))) |>
tidyr::unnest(cols = x)
# # A tibble: 3 × 2
# x y
# <int> <chr>
# 1 1 a
# 2 3 c
# 3 5 e
To better deal with those, I suggest something like this:
set.seed(43)
output |>
dplyr::mutate(
x = lapply(x, function(z) if (runif(1) < 0.5) unserialize(z)),
# additional step to convert length-0 to length-1 NA
x = lapply(x, function(z) if (is.null(z)) NA else z)
) |>
tidyr::unnest(cols = x)
# # A tibble: 5 × 2
# x y
# <int> <chr>
# 1 1 a
# 2 NA b
# 3 3 c
# 4 NA d
# 5 5 e