I have some very large datasets for which I would like to avoid storing duplicated rows. My idea was to create a hash for each row and only store it if it doesn't exist yet. Admittedly, I know very little about hashing other than it exists. I looked around and found that digest
seems to do what I want. However, I am having an issue with getting the hashed values to match when using apply
over a data.table
. I have a feeling this is related to how the rows are passed in via apply
, but I cannot come up with a solution.
Here is a simple example:
library(data.table)
library(digest)
set.seed(123)
x = data.table(col1 = c("A", "B", "C"), col2 = 1:3)
hash = apply(x, 1, digest)
hash
# "9608821bb8a76e3f7b0798ebc2160258" "7dbcdb0882f23925d73c07f6711b0891" "ad79fca9c97c66c37f897528d3996bc6"
digest(x[1])
# "1558712ec6c0b7bef303190f0ce80e63"
I have tried concatenating the row before hashing as well as messing around with the serialize
input to digest
, but nothing gets me what I want. How can I efficiently hash each row of a data.table
and arrive at the same value if I hash a row on its own?
The reason you get different hashes is because digest
gets a character vector from apply
instead of a data.table with 1 row.
apply(x, 1, str)
# Named chr [1:2] "A" "1"
# - attr(*, "names")= chr [1:2] "col1" "col2"
# Named chr [1:2] "B" "2"
# - attr(*, "names")= chr [1:2] "col1" "col2"
# Named chr [1:2] "C" "3"
# - attr(*, "names")= chr [1:2] "col1" "col2"
You can use this code instead to get your hashes:
hash = sapply(seq_along(x[[1]]), \(i) digest(x[i]))
If hashing is slow, you can use this code to see if a row is already in your dataset:
v = x[1]
sapply(seq_along(x[[1]]), \(i) identical(v, x[i])) |> any()