乌尔都语文本文档分类的机器和基于深度学习的方法的基准性能

论文标题

乌尔都语文本文档分类的机器和基于深度学习的方法的基准性能

Benchmark Performance of Machine And Deep Learning Based Methodologies for Urdu Text Document Classification

论文作者

Asim, Muhammad Nabeel, Ghani, Muhammad Usman, Ibrahim, Muhammad Ali, Ahmad, Sheraz, Mahmood, Waqar, Dengel, Andreas

论文摘要

为了为乌尔都语文本文档分类提供基准性能，本文的贡献是多方面的。首先，它可以根据6类手动标记公开可用的基准数据集。其次，它通过嵌入10种基于滤波器的功能选择算法来研究传统的基于机器学习的乌尔都语文本文档分类方法的性能影响，这些算法已广泛用于其他语言。第三，这是第一次，它是针对乌尔都语文本文档分类的各种基于深度学习的方法的性能。在这方面，为了实验，我们适应了10种深度学习分类方法，这些方法具有为英语文本分类提供的最佳性能数字。第四，它还研究了从变形金刚的乌尔都语方法中使用的双向编码器表示转移学习的性能影响。第五，它评估了混合方法的完整性，该方法结合了传统的基于机器学习的功能工程和基于深度学习的自动化功能工程。实验结果表明，在两个封闭的源基准数据集上，特征选择方法和支持向量机的最先进的性能CLE URDU DIGEST 1000K和CLE URDU DIGEST上的最先进的性能，分别为100万次，明显的边缘分别为32％，分别为13％。在所有三个数据集中，标准化的差异度量均优于其他基于过滤器的特征选择算法，因为它极大地提高了所有采用的机器学习，深度学习和混合方法的性能。源代码和呈现的数据集可在GitHub存储库中找到。

In order to provide benchmark performance for Urdu text document classification, the contribution of this paper is manifold. First, it pro-vides a publicly available benchmark dataset manually tagged against 6 classes. Second, it investigates the performance impact of traditional machine learning based Urdu text document classification methodologies by embedding 10 filter-based feature selection algorithms which have been widely used for other languages. Third, for the very first time, it as-sesses the performance of various deep learning based methodologies for Urdu text document classification. In this regard, for experimentation, we adapt 10 deep learning classification methodologies which have pro-duced best performance figures for English text classification. Fourth, it also investigates the performance impact of transfer learning by utiliz-ing Bidirectional Encoder Representations from Transformers approach for Urdu language. Fifth, it evaluates the integrity of a hybrid approach which combines traditional machine learning based feature engineering and deep learning based automated feature engineering. Experimental results show that feature selection approach named as Normalised Dif-ference Measure along with Support Vector Machine outshines state-of-the-art performance on two closed source benchmark datasets CLE Urdu Digest 1000k, and CLE Urdu Digest 1Million with a significant margin of 32%, and 13% respectively. Across all three datasets, Normalised Differ-ence Measure outperforms other filter based feature selection algorithms as it significantly uplifts the performance of all adopted machine learning, deep learning, and hybrid approaches. The source code and presented dataset are available at Github repository.

下载PDF全文

下载文献需遵守相关版权规定

论文标题