具有噪音保护的属性的设置选择中的偏差

论文标题

具有噪音保护的属性的设置选择中的偏差

Mitigating Bias in Set Selection with Noisy Protected Attributes

论文作者

Mehrotra, Anay, Celis, L. Elisa

论文摘要

子集选择算法在AI驱动的应用程序中无处不在，包括在线招聘门户和图像搜索引擎，因此，必须根据性别或种族等受保护的属性来歧视这些工具。当前，公平子集选择算法假定受保护的属性被称为数据集的一部分。但是，受保护的属性可能是由于数据收集过程中的错误或归纳而嘈杂的（在现实世界中通常是这种情况）。尽管大量的工作探讨了噪声对机器学习算法的性能的影响，但其对公平性的影响在很大程度上尚未受到审查。我们发现，在存在嘈杂的受保护属性的情况下，试图在不考虑噪音的情况下提高公平性，实际上可以减少结果的公平性！为了解决这个问题，我们考虑了一个现有的噪声模型，其中有有关受保护属性的概率信息（例如[58，34，20，46]），并且在嘈杂条件下可能可以进行公平选择吗？我们制定了一个``denoed''的选择问题，该问题适用于大量的公平指标；鉴于所需的公平目标，对剥落问题的解决方案最多只有很小的乘法数量，违反了目标。尽管这个固定的问题被证明是NP-HARD，但我们为其提供了基于线性编程的近似算法。我们在合成和现实世界数据集上评估了这种方法。我们的经验结果表明，尽管存在噪音受保护的属性，但这种方法可以产生子集，从而显着改善公平度量，并且与先前的噪声方法相比，在实用性和公平性之间具有更好的帕托托贸易。

Subset selection algorithms are ubiquitous in AI-driven applications, including, online recruiting portals and image search engines, so it is imperative that these tools are not discriminatory on the basis of protected attributes such as gender or race. Currently, fair subset selection algorithms assume that the protected attributes are known as part of the dataset. However, protected attributes may be noisy due to errors during data collection or if they are imputed (as is often the case in real-world settings). While a wide body of work addresses the effect of noise on the performance of machine learning algorithms, its effect on fairness remains largely unexamined. We find that in the presence of noisy protected attributes, in attempting to increase fairness without considering noise, one can, in fact, decrease the fairness of the result! Towards addressing this, we consider an existing noise model in which there is probabilistic information about the protected attributes (e.g., [58, 34, 20, 46]), and ask is fair selection possible under noisy conditions? We formulate a ``denoised'' selection problem which functions for a large class of fairness metrics; given the desired fairness goal, the solution to the denoised problem violates the goal by at most a small multiplicative amount with high probability. Although this denoised problem turns out to be NP-hard, we give a linear-programming based approximation algorithm for it. We evaluate this approach on both synthetic and real-world datasets. Our empirical results show that this approach can produce subsets which significantly improve the fairness metrics despite the presence of noisy protected attributes, and, compared to prior noise-oblivious approaches, has better Pareto-tradeoffs between utility and fairness.

下载PDF全文

下载文献需遵守相关版权规定

论文标题