> This can also provide a simple "upper bound" on how differentialy private your...

frankmcsherry · on Feb 25, 2018

I think what they are getting at is the following, which isn't a proof or anything (if you are looking for rules to follow, the much simpler "your model isn't differentially private" is probably even more accurate):

Let X be a set of iid samples drawn from a distribution D, let M be a model trained with eps-DP on the set X, and let x' be a fresh sample from D.

Let's think about the random variable M(x'), which represents your test loss or whatever. Differential privacy says that the density of this random variable is pointwise within a factor of exp(eps) of M'(x') where M' is the random model resulting from training on X \cup { x' }. But M'(x') is distributed per a random x', random X, and M' trained on both just like your training loss or whatever.

You can push linearity of expectations through all of this to show that your expected loss on test and training should vary by at most a factor of exp(eps), but I'm not aware of a pointwise test for a specific model (and it may not make sense, as DP isn't a property of specific models, but rather distributions of them).

Edit: you can read a bit more about a special case of this at https://windowsontheory.org/2014/02/04/differential-privacy-...

maksimum · on Feb 25, 2018

Thanks, using DP to bound deviations of arbitrary functions of a model is a neat idea.

I wonder if it it makes sense going from generalization error to an estimate of something similar in spirit to (but not as strong as) differential privacy, as the top-level comment suggested?

For example I want to empirically argue that a particular function of my GMM, let's say the log-likelihood of x_i, is "private." To do so I form C iid train/test splits. For each split I estimate the density of the likelihood for train and test samples, and estimate an upper bound on their ratio. As a result I get C samples of epsilon, and I can use some kind of simple bound (Chebyshev?) on the probability of epsilon being within an interval.

The idea is that we already have some "privacy" [1] from the data-sampling distribution. So we don't necessarily need to add noise to our algorithm. And it would be interesting to measure this privacy (at least for a particular function of the model) empirically.

[1] http://ieeexplore.ieee.org/abstract/document/6686180/