rdata-maskinganonymize

Data masking in relational table using R


I am trying to mask data in such a way that referential integrity is not compromised.

My table Customer has this data:

Customer table

Customer_ID | Customer_Name | Address                                     | Phone        | Product_ID
143         |  Mandy Rutter | 2465 Crim Lane. Shawneetown , New York      | 718-530-0536 | 995
142         | Rich Raymond  | 2150 Linda Street ,Eagleville ,Pennsylvania | 267-902-9185 | 876

Product Table

Product_Seq | Product_Name
995         | T-Shirt
876         | Blazer

Now I want to mask the Product_ID ( say) in customer table . When I mask it the transformation is 995 => xyz and 876 = > pqr

Now the same masking should be applied to Product table otherwise I wont know the product_Name for a particular customer_id.

I tried using digest library based solution. It works if I mask alphanumeric, however for integers it NA.

For masking/anonymizing it uses following code:

anonymize <- function(x, algo="crc32"){
  unq_hashes <- vapply(unique(x), function(object) digest(object, algo=algo), FUN.VALUE="", USE.NAMES=TRUE)
  unname(unq_hashes[x])
}

Can you please let me know why is that? Is there any recent packages for masking which would maintain the referential integrity while doing so?

The other package i thought off was anonymizer however it looks like it treats only a few columns as sensitive and it converts numbers to alphanumeric while masking.

I am trying to copy data from one DBMS system ( a few related tables) , anonymize the data on some of the columns while maintaining the relation and put in into other DBMS system.


Solution

  • There are two things that seem important for you

    1. anonymity
    2. referential integrity

    For both of your requirements that solution from the blog article you linked is a bad choice.

    Anonymity

    Just hashing does not provide anonymity. The article also mentions (but it is not in the code) you probably at least want to add a salt.

    Just an example:

    A number like 211 will be af9fad5f as a CRC32 hash. If the person you share your data with sees this 8char(32bit) alphanumeric string it probably will assume that this might be a CRC32 hash. The good thing with hashes is you can not easily calculate back starting from af9fad5f to 211. The bad thing is, most short words/ hashes are already precalculated and easy to look up in what is called a rainbow table (e.g. https://md5hashing.net/hash/crc32/af9fad5f).

    This basically means everybody could just look up the "clear text" behind the crc32 hashes. (same for all other hashes). Adding a salt prevents this. (this salt must of course be kept secret!)

    Referential Integrity The referential integrity is kept. 211 will be always be af9fad5f as a CRC32 hash - this is static and there is no random effect to it. So the Product_ID would stay the same for all your tables. Which is what you need.

    But just to be sure I would use SHA256 instead of CRC32. In CRC32 everything will be mapped to a 8chars alphanumeric (32bit). If you have quite a lot of data - there is some chance of hash collisions. This means two numbers/ids in the same table actually having the same hash. With SHA256 this is next to impossible.

    Overall I think using the anonymizer package seems ok. (it is not actively maintained - but functionality seems to be ok)

    install.packages("devtools")
    devtools::install_github("paulhendricks/anonymizer")
    # Some test data
    testdata <- data.frame(t1 = c(211,11,9), t2 = c(2,3,9))
    anonymizer::anonymize(testdata$t1, .algo = "sha256", .seed = 1)
    anonymizer::anonymize(testdata$t2, .algo = "sha256", .seed = 1)
    

    t1

    [1] "9ebb37da5a7e1db2a0ff7d0e9aa6df5b6d27a5928ce0454a9e71655cf5a16e46"
    [2] "ec3e0f47c01a40969f933b9179edee184bc2d57a77e1941fc1fe773c8ac429b6"
    [3] "14439a3c932d0f133ac5f5f9b147be16b4ffd854664b58fcae922084da984e2e"
    

    t2

    [1] "03a5ed142ba102af5c8e4328f54c8785310b7a8f1881b3bb9d1803261b64e91a"
    [2] "0b65dd05edbd9154c3668ce294ca875e1d079967051c06238516d1a0bb233b7c"
    [3] "14439a3c932d0f133ac5f5f9b147be16b4ffd854664b58fcae922084da984e2e"
    

    As you the last hash of testdata$t1 and testdata$t2 is the same, because it is both times 9.

    The other package I thought off was anonymizer however it looks like it treats only a few columns as sensitive and it converts numbers to alphanumeric while masking.

    How do you mean this? Do you want your values masked by numerical values ? (for this you could in R convert the string in a numerical value). But at least for product ids I don't really see the benefit.

    Addition from comments:

    @Steffen , can we somehow replace numeric values with numbers only. For example SSN is 9 digits and I want to replace it another number of 9 digits only.

    Yes, but not in a nice way with the hashing based solutions outlined above. Would probably require you to write some more code. Which needs to be done very carefully, because it is quite easy to miss something while ensuring referential integrity, avoiding collusions and preventing re-identification). I would really try to go with a existing solution.

    You could also take a look at the sdcMicro package in R. I don't think they have something for the referential integrity part. But you might be able to circumvent this by joining tables then using a anonymization function (this way e.g. product_id would only be in one column) and then splitting to two datasets again.

    I think it might be also a good idea to use another tool to do the anonymization - take a look at this open source tool ARX – Data Anonymization Tool

    Downside of an external GUI tool like this would be, that it is probably not a good idea, if you plan to create the anonymized datasets more than just once. Since you don't have a program, which you can just run.