background
i have some private survey data that contains a column of confidential information: the geographic location of the survey respondents. under no circumstances can this information be released.
as is common in survey research, in order for users to correctly calculate a variance on my survey data set, those users will either need that geographic location (unacceptable) or, alternatively, a set of replicate weights. i can create that set of replicate weights; however, it's quite easy to look at the correlations between those weights and back-calculate which of the survey respondents share the same geographic location. that is also unacceptable.
to help me with this question, you don't have to be familiar with replicate weights
-- just think of them as a few columns of strongly-correlated clustered data.
i understand that if i want to maintain that clustering, an evil data user will always have semi-decent guesses at who shares geographic locations; i just want to make that guessing game less precise. on the un-obfuscated replicate weights, an evil data user can figure out 100% of the cases.
request
i am looking for a technique that
- prevents the public use file users from easily deducing the shared geographic location off of the correlations between my replicate weights variables
- does not obliterate the correlations between my columns of data (the replicate weights variables)
- can be implemented on an R
data.frame
object without a major time investment
i say shared because the evil user might not know where the location is, but they might know if two survey respondents are from the same location -- an unacceptable possibility.
what i have tried
i don't really want to re-invent the wheel here. i am looking for r syntax, an r package, or anything else that would be relatively straightforward to implement. i've found one, two, three, four papers describing techniques that would all be suitable for my purposes; unfortunately, none of the authors have been willing to share actual code to implement them.
i can do simple things like add and subtract random values to my replicate weights columns according to a normal distribution, but i'd prefer to rely on the work of someone who understands privacy issues better than i do.
thanks!!!!
sdcMicro
package – Nakia