rhash

R: Hashing rows of data.table


I have some very large datasets for which I would like to avoid storing duplicated rows. My idea was to create a hash for each row and only store it if it doesn't exist yet. Admittedly, I know very little about hashing other than it exists. I looked around and found that digest seems to do what I want. However, I am having an issue with getting the hashed values to match when using apply over a data.table. I have a feeling this is related to how the rows are passed in via apply, but I cannot come up with a solution.

Here is a simple example:

library(data.table)
library(digest)

set.seed(123)

x = data.table(col1 = c("A", "B", "C"), col2 = 1:3)
hash = apply(x, 1, digest)
hash
# "9608821bb8a76e3f7b0798ebc2160258" "7dbcdb0882f23925d73c07f6711b0891" "ad79fca9c97c66c37f897528d3996bc6"

digest(x[1])
# "1558712ec6c0b7bef303190f0ce80e63"

I have tried concatenating the row before hashing as well as messing around with the serialize input to digest, but nothing gets me what I want. How can I efficiently hash each row of a data.table and arrive at the same value if I hash a row on its own?


Solution

  • The reason you get different hashes is because digest gets a character vector from apply instead of a data.table with 1 row.

    apply(x, 1, str)
    # Named chr [1:2] "A" "1"
    # - attr(*, "names")= chr [1:2] "col1" "col2"
    # Named chr [1:2] "B" "2"
    # - attr(*, "names")= chr [1:2] "col1" "col2"
    # Named chr [1:2] "C" "3"
    # - attr(*, "names")= chr [1:2] "col1" "col2"
    

    You can use this code instead to get your hashes:

    hash = sapply(seq_along(x[[1]]), \(i) digest(x[i]))
    

    If hashing is slow, you can use this code to see if a row is already in your dataset:

    v = x[1]
    sapply(seq_along(x[[1]]), \(i) identical(v, x[i])) |> any()