推荐系统的隐私综合数据生成

论文标题

Privacy-Preserving Synthetic Data Generation for Recommendation Systems

论文作者

Liu, Fan, Cheng, Zhiyong, Chen, Huilin, Wei, Yinwei, Nie, Liqiang, Kankanhalli, Mohan

论文摘要

推荐系统主要基于用户的历史交互数据（例如，先前单击或购买的项目）进行预测。收集用户的行为数据以构建建议模型时，存在隐私泄漏的风险。但是，现有的保留隐私解决方案旨在仅在模型培训和结果收集阶段解决隐私问题。直接与组织共享私人用户交互数据或向公众发布时，隐私泄漏的问题仍然存在。为了解决此问题，在本文中，我们提出了一个用户隐私可控的合成数据生成模型（UPC-SDG的缩写），该模型可根据用户的隐私偏好为用户生成合成的交互数据。生成模型旨在提供某些隐私保证，同时在数据级别和项目级别上最大限度地提高生成的合成数据的实用性。具体来说，在数据级别上，我们设计了一个选择模块，该模块选择那些从用户的交互数据中对用户偏好贡献较少贡献的项目。在项目级别上，提出了一个合成数据生成模块，以生成基于用户偏好对应的项目对应的合成项目。此外，我们还提出了一种隐私 - 实用权衡策略，以平衡合成数据的隐私和效用。已经在三个可公开访问的数据集上进行了广泛的实验和消融研究，以证明我们的方法是合理的，证明了其在用户隐私偏好下生成合成数据的有效性。

Recommendation systems make predictions chiefly based on users' historical interaction data (e.g., items previously clicked or purchased). There is a risk of privacy leakage when collecting the users' behavior data for building the recommendation model. However, existing privacy-preserving solutions are designed for tackling the privacy issue only during the model training and results collection phases. The problem of privacy leakage still exists when directly sharing the private user interaction data with organizations or releasing them to the public. To address this problem, in this paper, we present a User Privacy Controllable Synthetic Data Generation model (short for UPC-SDG), which generates synthetic interaction data for users based on their privacy preferences. The generation model aims to provide certain privacy guarantees while maximizing the utility of the generated synthetic data at both data level and item level. Specifically, at the data level, we design a selection module that selects those items that contribute less to a user's preferences from the user's interaction data. At the item level, a synthetic data generation module is proposed to generate a synthetic item corresponding to the selected item based on the user's preferences. Furthermore, we also present a privacy-utility trade-off strategy to balance the privacy and utility of the synthetic data. Extensive experiments and ablation studies have been conducted on three publicly accessible datasets to justify our method, demonstrating its effectiveness in generating synthetic data under users' privacy preferences.

下载PDF全文

下载文献需遵守相关版权规定

论文标题