论文标题

Frecdo:法国跨域方言标识的大型语料库

FreCDo: A Large Corpus for French Cross-Domain Dialect Identification

论文作者

Gaman, Mihaela, Chifu, Adrian-Gabriel, Domingues, William, Ionescu, Radu Tudor

论文摘要

我们为法国方言识别提供了一种新颖的语料库,其中包括413,522个法国文本样本,这些样本是从比利时,加拿大,法国和瑞士的公共新闻网站收集的。为了确保对模型的方言识别性能进行准确的估计,我们设计了该语料库以消除与主题,写作样式和出版物来源相关的潜在偏见。更确切地说,在搜索不同的关键字(主题)时,从不同的新闻网站收集培训,验证和测试拆分。这导致法国跨域(FRECDO)方言标识任务。我们使用四个竞争基线,一个微调的卡门木木仪型,基于微调的卡门木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木木培来进行实验,基于微调矢量机(SVM)分类器基于微调的camembert特征,以及基于单词n-grams的SVM。除了提出定量结果外,我们还对卡门培尔学院学到的最歧视性特征进行了分析。我们的语料库可从https://github.com/mihaelagaman/frecdo获得。

We present a novel corpus for French dialect identification comprising 413,522 French text samples collected from public news websites in Belgium, Canada, France and Switzerland. To ensure an accurate estimation of the dialect identification performance of models, we designed the corpus to eliminate potential biases related to topic, writing style, and publication source. More precisely, the training, validation and test splits are collected from different news websites, while searching for different keywords (topics). This leads to a French cross-domain (FreCDo) dialect identification task. We conduct experiments with four competitive baselines, a fine-tuned CamemBERT model, an XGBoost based on fine-tuned CamemBERT features, a Support Vector Machines (SVM) classifier based on fine-tuned CamemBERT features, and an SVM based on word n-grams. Aside from presenting quantitative results, we also make an analysis of the most discriminative features learned by CamemBERT. Our corpus is available at https://github.com/MihaelaGaman/FreCDo.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源