论文标题
是的,是的:用于ACL滚动审查及以后的主动数据收集
Yes-Yes-Yes: Proactive Data Collection for ACL Rolling Review and Beyond
论文作者
论文摘要
向公开可用的文本源的转变使语言处理以前所未有的规模处理,但在公共和公开许可数据的范围内却较低的域名。主动收集研究文本数据是解决这一稀缺性的可行策略,但由于数据收集的许多道德,法律和机密性方面,缺乏系统的方法论。我们的工作提出了一项关于同行评审中主动数据收集的案例研究 - 一个具有挑战性且资源不足的NLP领域。我们概述了主动数据收集的道德和法律逃避者,并介绍了“是的,是可能的”,这是第一个基于捐赠的同行评审数据收集工作流,满足了这些要求。我们在ACL滚动审查中报告了Yes-Yes-Yes的实施,并经验研究了积极数据收集对数据集大小的含义以及同行评审平台上捐赠行为引起的偏见。
The shift towards publicly available text sources has enabled language processing at unprecedented scale, yet leaves under-serviced the domains where public and openly licensed data is scarce. Proactively collecting text data for research is a viable strategy to address this scarcity, but lacks systematic methodology taking into account the many ethical, legal and confidentiality-related aspects of data collection. Our work presents a case study on proactive data collection in peer review -- a challenging and under-resourced NLP domain. We outline ethical and legal desiderata for proactive data collection and introduce "Yes-Yes-Yes", the first donation-based peer reviewing data collection workflow that meets these requirements. We report on the implementation of Yes-Yes-Yes at ACL Rolling Review and empirically study the implications of proactive data collection for the dataset size and the biases induced by the donation behavior on the peer reviewing platform.