检索增强的数据增强，以解决有关隐私政策的问题

论文标题

检索增强的数据增强，以解决有关隐私政策的问题

Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies

论文作者

Parvez, Md Rizwan, Chi, Jianfeng, Ahmad, Wasi Uddin, Tian, Yuan, Chang, Kai-Wei

论文摘要

隐私策略的先前研究将问题回答（QA）任务构建为确定最相关的文本段或给定用户查询的策略文档中的句子列表。现有的标记数据集严重不平衡（只有少数相关的段），从而限制了该域中的质量检查性能。在本文中，我们基于结合猎犬模型开发一个数据增强框架，该模型捕获了未标记的策略文件中相关的文本段，并扩大了培训集中的积极示例。此外，为了提高增强数据的多样性和质量，我们利用多种预训练的语言模型（LMS），并使用降低降噪滤波器模型级联。利用我们在PrivacyQA基准测试的增强数据，我们将现有基线提高了很大的利润率（10 \％F1），并获得了50 \％的新最新F1分数。我们的消融研究提供了对我们方法有效性的进一步见解。

Prior studies in privacy policies frame the question answering (QA) task as identifying the most relevant text segment or a list of sentences from a policy document given a user query. Existing labeled datasets are heavily imbalanced (only a few relevant segments), limiting the QA performance in this domain. In this paper, we develop a data augmentation framework based on ensembling retriever models that captures the relevant text segments from unlabeled policy documents and expand the positive examples in the training set. In addition, to improve the diversity and quality of the augmented data, we leverage multiple pre-trained language models (LMs) and cascade them with noise reduction filter models. Using our augmented data on the PrivacyQA benchmark, we elevate the existing baseline by a large margin (10\% F1) and achieve a new state-of-the-art F1 score of 50\%. Our ablation studies provide further insights into the effectiveness of our approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题