论文标题

我们应该如何代理种族/种族?将贝叶斯改进的姓氏地理编码与机器学习方法进行比较

How should we proxy for race/ethnicity? Comparing Bayesian improved surname geocoding to machine learning methods

论文作者

Decter-Frain, Ari

论文摘要

贝叶斯改进的姓氏地理编码(BISG)是在不包含它的选民注册文件中代理种族/民族的最流行方法。本文使用来自加利福尼亚州,佛罗里达州,北卡罗来纳州和佐治亚州的自我报告的种族/民族的选民文件进行了针对以前未经测试的机器学习替代方案的基础。该分析得出三个关键发现。首先,机器学习在种族/种族的个体分类中始终优于BISG。其次,BISG和机器学习方法表现出用于估计区域种族组成的不同偏差。第三,所有方法的性能在各州之间有很大的变化。这些结果表明,预训练的机器学习模型比BISG更可取,以进行单个分类。此外,各州的混合结果强调了研究人员需要在其感兴趣的人群中验证其所选择的种族/种族代理。

Bayesian Improved Surname Geocoding (BISG) is the most popular method for proxying race/ethnicity in voter registration files that do not contain it. This paper benchmarks BISG against a range of previously untested machine learning alternatives, using voter files with self-reported race/ethnicity from California, Florida, North Carolina, and Georgia. This analysis yields three key findings. First, machine learning consistently outperforms BISG at individual classification of race/ethnicity. Second, BISG and machine learning methods exhibit divergent biases for estimating regional racial composition. Third, the performance of all methods varies substantially across states. These results suggest that pre-trained machine learning models are preferable to BISG for individual classification. Furthermore, mixed results across states underscore the need for researchers to empirically validate their chosen race/ethnicity proxy in their populations of interest.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源