论文标题
使用特定领域的语言模型和数据增强方法检测ESG主题
Detecting ESG topics using domain-specific language models and data augmentation approaches
论文作者
论文摘要
尽管在基于深度学习的语言建模方面取得了最新进展,但由于缺乏适当标记的数据,金融领域中的许多自然语言处理(NLP)任务仍然具有挑战性。可能限制任务绩效的其他问题是通用语料库之间单词分布的差异 - 通常用于预训练语言模型和金融语料库,这些语言通常表现出专业的语言和符号。在这里,我们研究了两种可能有助于减轻这些问题的方法。首先,我们使用来自商业和金融新闻的大量内域数据进行进一步的语言模型预训练。然后,我们采用增强方法来增加数据集的大小进行模型进行微调。我们报告了有关环境,社会和治理(ESG)争议数据集的发现,并证明两种方法对分类任务的准确性都是有益的。
Despite recent advances in deep learning-based language modelling, many natural language processing (NLP) tasks in the financial domain remain challenging due to the paucity of appropriately labelled data. Other issues that can limit task performance are differences in word distribution between the general corpora - typically used to pre-train language models - and financial corpora, which often exhibit specialized language and symbology. Here, we investigate two approaches that may help to mitigate these issues. Firstly, we experiment with further language model pre-training using large amounts of in-domain data from business and financial news. We then apply augmentation approaches to increase the size of our dataset for model fine-tuning. We report our findings on an Environmental, Social and Governance (ESG) controversies dataset and demonstrate that both approaches are beneficial to accuracy in classification tasks.