Implementation techinque for differential privacy

Asked 19/11, 2016 at 15:36 Answered 20/10, 2020 at 10:58

python database statistics theory privacy

I am currently doing an experiment on a dataset using differential privacy concepts. So, I am trying to implement one of the mechanisms of differential privacy namely Laplace mechanisms using a sample dataset from UCI Machine Repository and python programming language.
Let's assume that we have simple counting query where we want to know the number of people who earns '<=50k' which are grouped by their 'occupation'

SELECT 
   adult.occupation, COUNT(adult.salary_group) As NumofPeople 
FROM 
   adult
WHERE 
   adult.salary_group = '<=50K'
GROUP BY 
   adult.occupation, adult.salary_group;

and this is the Laplace function I am trying to use

import numpy as np

def laplaceMechanism(x, epsilon):
    x +=  np.random.laplace(0, 1.0/epsilon, 1)[0]
return x

So, my question is how could I apply the function against the the data I got if we take epsilon=2, I know that Laplace Mechanism works by adding a random noise from the la place distribution to the true answer we get from the query. A bit of insight would be appreciated...

Sherrer answered 19/11, 2016 at 15:36 Comment(5)

Not clear what you are asking. You could iterate over the query results and apply the function to each row. You could put the query results in a Pandas DataFrame then apply the function - the DataFrame will make it easy to work with your data - I recommend watching the video from the link. It does depend, somewhat, on the type of data structure you put the query result in. – Blizzard 19/11, 2016 at 16:12

Have you worked your way through The Python Tutorial in the docs? You might get some ideas. – Blizzard 19/11, 2016 at 16:18

@Blizzard If it is just a matter of iterating through the query results I think thats manageable and yes I have tried to do the tutorial using Pandas DataFrame but thanks for the recommendation – Sherrer 19/11, 2016 at 16:23

If you come up with a working solution and would like a critique/feedback you could post it over at CodeReview, or if there are some sticky points in your solution, ask here and try to be specific. If you haven't already, you may want to peruse stackoverflow.com/help/asking , specifically stackoverflow.com/help/mcve. And it helps to state your intentions/goals - sometimes there is an altogether better way to accomplish something than what you tried. – Blizzard 19/11, 2016 at 16:37

You should use 1/epsilon, not epsilon. As epsilon gets to infinity, the laplace noise needs to go to zero. I've edited it. – Littman 21/9, 2019 at 16:5

Assuming you have already loaded the csv from the link into a database to conduct the sql query, you can apply your Laplacian function by first loading the results of the query into a pandas dataframe using pandas.readsql():

import pandas as pd

query =  '''SELECT 
   adult.occupation, COUNT(adult.salary_group) As NumofPeople 
FROM 
   adult
WHERE 
   adult.salary_group = '<=50K'
GROUP BY 
   adult.occupation, adult.salary_group;'''

df = pd.read_sql(query, '<database-connection-string>')

Then you can apply your function using pandas.Series.apply() using args to pass in your epsilon:

df['NumOfPeople]' = df['NumOfPeople'].apply(laplaceMechanism, args=(2,))

The above would obviously replace the NumOfPeople column with the adjusted values, you could choose to keep the new series separate, attach them to the dataframe as a new column with a different name, or clone the dataframe first to keep the old dataframe around too.

Aport answered 20/9, 2017 at 2:32 Comment(1)

Note that if you don't actually have it in a database, you can used pandas.read_csv and pandas's groupby and count functions to do everything in pandas without having to involve sql and/or databases. – Aport 20/9, 2017 at 2:33

David Dean already answered the technical part of your question, I will add that naively adding Laplace noise to statistics like might not work in certain cases, and will almost certainly lead to floating-point vulnerabilities.

It's important to think about the sensitivity of the mechanism. In your SQL example, if one person appears in multiple rows, they will be counted multiple times, and the noise won't be automatically scaled up to take this into account. It's OK with the Adult dataset because, but this shouldn't
Adding Laplace noise using a standard library will make it vulnerable to floating-point attacks, which might ruin the privacy guarantees that would otherwise hold for a continuous Laplace distribution.

Because of both of these points, it's a much better idea to use a library specific for differential privacy, for example Google's (disclaimer: I'm one of its authors).

Honig answered 20/10, 2020 at 10:58 Comment(0)

Recommended topics

Hot tags