pythondata-miningdata-security

Generalized Random Response for local differential privacy implementation


I have been tasked with implementing a local (non-interactive) differential privacy mechanism. I am working with a large database of census data. The only sensitive attribute is "Number of children" which is a numerical value ranging from 0 to 13.

I decided to go with the Generalized Random Response mechanism as it seems like the most intuitive method. This mechanism is described here and presented here.

After loading each value into an array (ignoring the other attributes for now), I perform the perturbation as follows.

d = 14 # values may range from 0 to 13

eps = 1 # epsilon level of privacy

p = (math.exp(eps)/(math.exp(eps)+d-1))
q = 1/(math.exp(eps)+d-1)

p_dataset = []

for row in dataset:
    coin = random.random()
    if coin <= p:
        p_dataset.append(row)
    else:
        p_dataset.append(random.randint(0,13))

Unless I have misinterpreted the definition, I believe this will guarantee epsilon differential privacy on p_dataset.

However, I am having difficulty understanding how the aggregator must interpret this dataset. Following the presentation above, I attempted to implement a method for estimating the number of individuals who answered a particular value.

v = 0 # we are estimating the number of individuals in the dataset who answered 0
nv = 0 # number of users in the perturbed dataset who answered the value

for row in p_dataset:
    if row == v:
        nv += 1

Iv = nv * p + (n - nv) * q
estimation = (Iv - (n*q)) / (p-q)

I do not know if I have correctly implemented the method described as I do not completely understand what it is doing, and cannot find a clear definition.

Regardless, I used this method to estimate the total amount of individuals who answered each value in the dataset with a value for epsilon ranging from 1 to 14, and then compared this to the actual values. The results are below (please excuse the formatting).

Results of estimation

As you can see, the utility of the dataset suffers greatly for low values of epsilon. Additionally, when executed multiple times, there was relatively little deviation in estimations, even for small values of epsilon.

For example, when estimating the number of participants who answered 0 and using an epsilon of 1, all estimations seemed to be centered around 1600, with the largest distance between estimations being 100. Considering the actual value of this query is 5969, I am led to believe that I may have implemented something incorrectly.

Is this the expected behaviour of the Generalized Random Response mechanism, or have I made a mistake in my implementation?


Solution

  • I think when getting a false answer, we cannot directly use p_dataset.append(random.randint(0,13)), because it contains true answer