Prefrec：具有人类偏好的推荐系统，可增强长期用户参与度

论文标题

Prefrec：具有人类偏好的推荐系统，可增强长期用户参与度

PrefRec: Recommender Systems with Human Preferences for Reinforcing Long-term User Engagement

论文作者

Xue, Wanqi, Cai, Qingpeng, Xue, Zhenghai, Sun, Shuo, Liu, Shuchang, Zheng, Dong, Jiang, Peng, Gai, Kun, An, Bo

论文摘要

当前推荐系统的进步在优化即时参与方面非常成功。但是，长期用户参与度（更理想的性能指标）仍然很难改进。同时，最近的强化学习（RL）算法显示了它们在各种长期目标优化任务中的有效性。因此，RL被广泛认为是优化推荐中长期用户参与的有前途的框架。尽管很有希望，但RL的应用在很大程度上取决于精心设计的奖励，但是设计与长期用户参与度相关的奖励非常困难。为了减轻问题，我们提出了一种新颖的范式，具有人类偏好（或基于偏好的推荐系统）的推荐系统，该系统允许RL建议系统从有关用户的偏好中学习有关历史行为的偏好，而不是明确定义的奖励。通过众包等技术，可以轻松访问此类偏好，因为它们不需要任何专家知识。借助Prefrec，我们可以完全利用RL在优化长期目标方面的优势，同时避免复杂的奖励工程。 Prefrec使用首选项以端到端的方式自动训练奖励功能。然后，奖励功能用于生成学习信号以训练推荐政策。此外，我们为Prefrec设计了一种有效的优化方法，该方法使用额外的价值函数，预期回归和奖励模型预培训来提高性能。我们对各种长期用户参与优化任务进行实验。结果表明，前fec在所有任务中都显着胜过先前的最新方法。

Current advances in recommender systems have been remarkably successful in optimizing immediate engagement. However, long-term user engagement, a more desirable performance metric, remains difficult to improve. Meanwhile, recent reinforcement learning (RL) algorithms have shown their effectiveness in a variety of long-term goal optimization tasks. For this reason, RL is widely considered as a promising framework for optimizing long-term user engagement in recommendation. Though promising, the application of RL heavily relies on well-designed rewards, but designing rewards related to long-term user engagement is quite difficult. To mitigate the problem, we propose a novel paradigm, recommender systems with human preferences (or Preference-based Recommender systems), which allows RL recommender systems to learn from preferences about users historical behaviors rather than explicitly defined rewards. Such preferences are easily accessible through techniques such as crowdsourcing, as they do not require any expert knowledge. With PrefRec, we can fully exploit the advantages of RL in optimizing long-term goals, while avoiding complex reward engineering. PrefRec uses the preferences to automatically train a reward function in an end-to-end manner. The reward function is then used to generate learning signals to train the recommendation policy. Furthermore, we design an effective optimization method for PrefRec, which uses an additional value function, expectile regression and reward model pre-training to improve the performance. We conduct experiments on a variety of long-term user engagement optimization tasks. The results show that PrefRec significantly outperforms previous state-of-the-art methods in all the tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题