technique to obfuscate clustered data and preserve privacy in r
Asked Answered
W

1

9

background

i have some private survey data that contains a column of confidential information: the geographic location of the survey respondents. under no circumstances can this information be released.

as is common in survey research, in order for users to correctly calculate a variance on my survey data set, those users will either need that geographic location (unacceptable) or, alternatively, a set of replicate weights. i can create that set of replicate weights; however, it's quite easy to look at the correlations between those weights and back-calculate which of the survey respondents share the same geographic location. that is also unacceptable.

to help me with this question, you don't have to be familiar with replicate weights -- just think of them as a few columns of strongly-correlated clustered data.

i understand that if i want to maintain that clustering, an evil data user will always have semi-decent guesses at who shares geographic locations; i just want to make that guessing game less precise. on the un-obfuscated replicate weights, an evil data user can figure out 100% of the cases.

request

i am looking for a technique that

  • prevents the public use file users from easily deducing the shared geographic location off of the correlations between my replicate weights variables
  • does not obliterate the correlations between my columns of data (the replicate weights variables)
  • can be implemented on an R data.frame object without a major time investment

i say shared because the evil user might not know where the location is, but they might know if two survey respondents are from the same location -- an unacceptable possibility.

what i have tried

i don't really want to re-invent the wheel here. i am looking for r syntax, an r package, or anything else that would be relatively straightforward to implement. i've found one, two, three, four papers describing techniques that would all be suitable for my purposes; unfortunately, none of the authors have been willing to share actual code to implement them.

i can do simple things like add and subtract random values to my replicate weights columns according to a normal distribution, but i'd prefer to rely on the work of someone who understands privacy issues better than i do.

thanks!!!!

Weinberger answered 13/6, 2014 at 9:59 Comment(4)
Try looking at the sdcMicro packageNakia
You cannot. More than one data scientist/software guru has shown it's easy to extract personal identification from allegedly anonymized big data clumps. Your choice is either, as you noted, to leave a path for someone to reconstruct the geodata, or to remove the geodata entirely and do your analysis based on some other factor.Oraliaoralie
the united states census bureau regularly does what i am describing, despite their own strict confidentiality rules. let's lower the bar and say, "if it's good enough for census, it's good enough for me." i am hereby defining a new term: WWCD? thanksWeinberger
thanks @Nakia i had never heard of that before! i spent some time trying to answer my own question with that toolkit. :)Weinberger
W
2

i have written this nine-step tutorial to walk through the process in an attempt to answer my own question. i am not an expert in the field of privacy/confidentiality and would love to hear both feedback about this idea and also other ideas. thanks!

http://www.asdfree.com/2014/09/how-to-provide-variance-calculation-on.html

Weinberger answered 15/6, 2014 at 10:38 Comment(2)
the link is dead :-(Bubb
whoops, apologies.. blog post: usgsd.blogspot.com/2014/09/… and code: github.com/ajdamico/asdfree/tree/archive/ConfidentialityWeinberger

© 2022 - 2024 — McMap. All rights reserved.