论文标题
降低情感分类的维度:发展最突出和可分离的特征
Dimensionality Reduction for Sentiment Classification: Evolving for the Most Prominent and Separable Features
论文作者
论文摘要
在情感分类中,大量的文本数据,其巨大的维度和固有的噪声使机器学习分类器难以提取高级和复杂的抽象。为了使数据少稀疏且在统计上更加显着,需要降低维度降低技术。但是,在现有的维度降低技术中,需要手动设置组件数量,从而导致最突出的功能损失,从而降低了分类器的性能。我们先前的工作,即术语存在计数(TPC)和术语存在比(TPR)已被证明是有效的技术,因为它们拒绝了较不可分的特征。但是,尽管在正面和负标签文档中具有较高的分布,但最突出和可分离的功能仍可能从初始功能集中删除。为了克服这个问题,我们提出了一个新框架,该框架由二维减少技术组成,即情感项的存在计数(SentITPC)和情感项的存在比率(SentItpr)。这些技术通过考虑SentItPC的项差和SentItpr的分布区别的比率来拒绝这些功能。此外,这些方法还分析了总分布信息。广泛的实验结果表明,所提出的框架将特征维度降低了大规模,从而显着改善了分类性能。
In sentiment classification, the enormous amount of textual data, its immense dimensionality, and inherent noise make it extremely difficult for machine learning classifiers to extract high-level and complex abstractions. In order to make the data less sparse and more statistically significant, the dimensionality reduction techniques are needed. But in the existing dimensionality reduction techniques, the number of components needs to be set manually which results in loss of the most prominent features, thus reducing the performance of the classifiers. Our prior work, i.e., Term Presence Count (TPC) and Term Presence Ratio (TPR) have proven to be effective techniques as they reject the less separable features. However, the most prominent and separable features might still get removed from the initial feature set despite having higher distributions among positive and negative tagged documents. To overcome this problem, we have proposed a new framework that consists of two-dimensionality reduction techniques i.e., Sentiment Term Presence Count (SentiTPC) and Sentiment Term Presence Ratio (SentiTPR). These techniques reject the features by considering term presence difference for SentiTPC and ratio of the distribution distinction for SentiTPR. Additionally, these methods also analyze the total distribution information. Extensive experimental results exhibit that the proposed framework reduces the feature dimension by a large scale, and thus significantly improve the classification performance.