从私人和公共人群的混合物中学习

论文标题

从私人和公共人群的混合物中学习

Learning from Mixtures of Private and Public Populations

论文作者

Bassily, Raef, Moran, Shay, Nandi, Anupama

论文摘要

我们在隐私限制下启动了一种新的监督学习模型的研究。想象一下，一项医学研究，数据集是从健康和不健康的人群中取样的。假设健康的人没有隐私问题（在这种情况下，我们将其数据称为“公共”），而不健康的人则希望对其数据进行严格的隐私保护。在此示例中，人口（数据分布）是私人（不健康）和公共（健康）子选集的混合物，可能会大不相同。受上述示例的启发，我们考虑了一个模型，其中种群$ \ MATHCAL {D} $是两个子群的混合物：一个私人子群体$ \ Mathcal {d} _ {\ sf priv} $私人和敏感数据的$，以及一个公共群体群体$ \ Mathcal $ \ \ \ \ \ no publisation no publ crospers no publ croupt。假定从$ \ Mathcal {D} $绘制的每个示例都包含一个隐私状态位，该位指示该示例是私人还是公共。目的是设计一种学习算法，该算法仅在私人示例中才能满足不同的隐私。在这种情况下，先前的工作假设是同质人群，在该人群中，私人和公共数据是由相同的分布产生的，特别是设计的解决方案，这些解决方案利用了这一假设。我们通过考虑$ \ mathbb {r}^d $学习线性分类符的问题来证明如何通过考虑学习线性分类器的问题来规避这一假设。我们表明，在隐私状态与目标标签相关的情况下（如上面的示例），在$ \ mathbb {r}^d $中的线性分类器可以在不可知的和可实现的设置中学习，并具有与经典（非派式）pac-pac-leartining相当的样本复杂性。众所周知，如果所有数据都被视为私有，则无法使用此任务。

We initiate the study of a new model of supervised learning under privacy constraints. Imagine a medical study where a dataset is sampled from a population of both healthy and unhealthy individuals. Suppose healthy individuals have no privacy concerns (in such case, we call their data "public") while the unhealthy individuals desire stringent privacy protection for their data. In this example, the population (data distribution) is a mixture of private (unhealthy) and public (healthy) sub-populations that could be very different. Inspired by the above example, we consider a model in which the population $\mathcal{D}$ is a mixture of two sub-populations: a private sub-population $\mathcal{D}_{\sf priv}$ of private and sensitive data, and a public sub-population $\mathcal{D}_{\sf pub}$ of data with no privacy concerns. Each example drawn from $\mathcal{D}$ is assumed to contain a privacy-status bit that indicates whether the example is private or public. The goal is to design a learning algorithm that satisfies differential privacy only with respect to the private examples. Prior works in this context assumed a homogeneous population where private and public data arise from the same distribution, and in particular designed solutions which exploit this assumption. We demonstrate how to circumvent this assumption by considering, as a case study, the problem of learning linear classifiers in $\mathbb{R}^d$. We show that in the case where the privacy status is correlated with the target label (as in the above example), linear classifiers in $\mathbb{R}^d$ can be learned, in the agnostic as well as the realizable setting, with sample complexity which is comparable to that of the classical (non-private) PAC-learning. It is known that this task is impossible if all the data is considered private.

下载PDF全文

下载文献需遵守相关版权规定

论文标题