论文标题
使用全文内容来表征和识别最畅销书籍
Using Full-Text Content to Characterize and Identify Best Seller Books
论文作者
论文摘要
可以从几个角度研究艺术作品,一个例子是随着时间的流逝,他们在读者中的接待。在目前的工作中,我们从文学作品的角度处理了这个有趣的话题,特别是评估预测一本书是否会成为畅销书的任务。与以前的方法不同,我们专注于书籍的全部内容,并考虑了可视化和分类任务。我们采用了可视化来对数据结构和属性进行初步探索,涉及半障碍和线性判别分析。然后,为了获得定量,更客观的结果,我们采用了各种分类器。此类方法与包含(i)1895年至1924年出版的(i)书籍的数据集一起使用,并被出版商每周的畅销书列表和(ii)在同一时期发表但没有在该列表中提到的文学作品奉献为畅销书。我们对方法的比较表明,最佳的结果 - 结合了单词的表示与逻辑回归分类器 - 导致剩下的一个和10倍的交叉验证的平均精度为0.75。这样的结果表明,仅使用文本的全部内容以很高的精度预测书籍的成功是不可行的。然而,我们的发现提供了有关导致文学作品相对成功的因素的见解。
Artistic pieces can be studied from several perspectives, one example being their reception among readers over time. In the present work, we approach this interesting topic from the standpoint of literary works, particularly assessing the task of predicting whether a book will become a best seller. Dissimilarly from previous approaches, we focused on the full content of books and considered visualization and classification tasks. We employed visualization for the preliminary exploration of the data structure and properties, involving SemAxis and linear discriminant analyses. Then, to obtain quantitative and more objective results, we employed various classifiers. Such approaches were used along with a dataset containing (i) books published from 1895 to 1924 and consecrated as best sellers by the Publishers Weekly Bestseller Lists and (ii) literary works published in the same period but not being mentioned in that list. Our comparison of methods revealed that the best-achieved result - combining a bag-of-words representation with a logistic regression classifier - led to an average accuracy of 0.75 both for the leave-one-out and 10-fold cross-validations. Such an outcome suggests that it is unfeasible to predict the success of books with high accuracy using only the full content of the texts. Nevertheless, our findings provide insights into the factors leading to the relative success of a literary work.