论文标题
富裕的国家和更富裕的代表
Richer Countries and Richer Representations
论文作者
论文摘要
我们检查了某些国家是否比其他国家更丰富地嵌入空间。我们发现,在培训语料库中名称频率低的国家更有可能被标记为子词,在嵌入空间上的语义上不太差异,并且不太可能正确预测:例如,加纳(正确的答案和唱片内的答案和播放内)并不预测,“生产最多的cocoa是[bask bask]。尽管这些性能差异和代表性危害是由于频率造成的,但我们发现频率与一个国家的GDP高度相关。因此,历史上的力量和财富不平等永久性。我们分析缓解策略的有效性;建议研究人员报告培训单词频率;并为社区推荐未来的工作,以定义和设计代表性保证。
We examine whether some countries are more richly represented in embedding space than others. We find that countries whose names occur with low frequency in training corpora are more likely to be tokenized into subwords, are less semantically distinct in embedding space, and are less likely to be correctly predicted: e.g., Ghana (the correct answer and in-vocabulary) is not predicted for, "The country producing the most cocoa is [MASK].". Although these performance discrepancies and representational harms are due to frequency, we find that frequency is highly correlated with a country's GDP; thus perpetuating historic power and wealth inequalities. We analyze the effectiveness of mitigation strategies; recommend that researchers report training word frequencies; and recommend future work for the community to define and design representational guarantees.