csvbigdataanonymize

Anonymization of Account Numbers in 2TB of CSV's


I have ~2TB of CSV's where the first 2 columns contains two ID numbers. These need to be anonymized so the data can be used in academic research. The anonymization can be (but does not have to be) irreversible. These are NOT medical records, so I do not need the fanciest cryptographic algorithm.

The Question:

Standard hashing algorithms make really long strings, but I will have to do a bunch of ID-matching (i.e. 'for subset of rows in data containing ID XXX, do...)' to process the anonymized data, so this is not ideal. Is there a better way?

For example, If I know there are ~10 million unique account numbers, is there a standard way of using the set of integers [1:10million] as replacement/anonymized ID's?

The computational constraint is that data will likely be anonymized on a 32-core ~500GB server machine.


Solution

  • I will assume that you want to make a single pass, one CSV with ID numbers as input, another CSV with anonymized numbers as output. I will also assume the number of unique IDs is somewhere on the order of 10 million or less.

    It is my thought that it would be best to use some totally arbitrary one-to-one function from the set of ID numbers (N) to the set of de-identified numbers (D). This would be more secure. If you used some sort of hash function, and an adversary learned what the hash was, the numbers in N could be recovered without too much trouble with a dictionary attack. Instead I suggest a simple lookup table: ID 1234567 maps to de-identified number 4672592, etc. The correspondence would be stored in another file, and an adversary without that file would not be able to do much.

    With 10 million or fewer records, on a machine such as you describe, this is not a big problem. A sketch program in pseudo-Python:

    mapping = {}
    unused_numbers = list(range(10000000))
    
    while data:
        read record
        for each ID number N in record:
            if N in mapping:
                D = mapping[N]
            else:
                D = choose_random(unused_numbers)
                unused_numbers.del(D)
                mapping[N] = D
            replace N with D in record
        write record
    
    write mapping to lookup table file