论文标题

基于复发网络的文本表征

Text characterization based on recurrence networks

论文作者

Souza, Bárbara C. e, Silva, Filipi N., de Arruda, Henrique F., da Silva, Giovana D., Costa, Luciano da F., Amancio, Diego R.

论文摘要

几个复杂系统的特征是在几个时间和空间尺度上呈现复杂的特征。这些多尺度特征在各种应用中都使用,包括更好地理解疾病,表征运输系统以及城市之间的比较。特别是,文本还以层次结构为特征,可以通过使用多尺度概念和方法来处理。文本的多尺度属性构成了值得进一步调查的主题。此外,可以通过强调具有更多信息内容的单词来获得更有效的文本表征和分析方法。目前的工作旨在开发这些可能性,同时专注于网络的介绍性表示。更具体地说,我们采用介绍方法的扩展来表示文本叙事,其中仅认为语音(主语,动词和直接对象)之间的标记部分之间的复发关系才能在顺序的文本(例如段落)之间建立连接。然后,通过考虑依赖比例的互补方法来实现文本的表征:可访问性,对称性和复发特征。为了评估这些概念和方法的潜力,我们解决了区分文学类型(小说和非小说)的问题。考虑了这两种流派中的300本书,并使用上述方法进行了比较。所有方法都能在两种流派之间在一定程度上区分。可访问性和对称性反映了叙事不对称,而复发签名为沿叙事发生的非序列语义连接提供了更直接的指示。

Several complex systems are characterized by presenting intricate characteristics taking place at several scales of time and space. These multiscale characterizations are used in various applications, including better understanding diseases, characterizing transportation systems, and comparison between cities, among others. In particular, texts are also characterized by a hierarchical structure that can be approached by using multi-scale concepts and methods. The multiscale properties of texts constitute a subject worth further investigation. In addition, more effective approaches to text characterization and analysis can be obtained by emphasizing words with potentially more informational content. The present work aims at developing these possibilities while focusing on mesoscopic representations of networks. More specifically, we adopt an extension to the mesoscopic approach to represent text narratives, in which only the recurrent relationships among tagged parts of speech (subject, verb and direct object) are considered to establish connections among sequential pieces of text (e.g., paragraphs). The characterization of the texts was then achieved by considering scale-dependent complementary methods: accessibility, symmetry and recurrence signatures. In order to evaluate the potential of these concepts and methods, we approached the problem of distinguishing between literary genres (fiction and non-fiction). A set of 300 books organized into the two genres was considered and were compared by using the aforementioned approaches. All the methods were capable of differentiating to some extent between the two genres. The accessibility and symmetry reflected the narrative asymmetries, while the recurrence signature provided a more direct indication about the non-sequential semantic connections taking place along the narrative.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源