论文标题
无监督的令牌化学习
Unsupervised Tokenization Learning
论文作者
论文摘要
在介绍的研究中,我们发现所谓的“过渡自由度”指标与统计指标(例如相互信息和有条件的概率)相比,无监督的令牌化目的似乎优越,从而提供了在经过探索的多语言公司的0.71到1.0范围内的F量级得分。我们发现,为成功的令牌化,不同的语言需要该度量的不同分支(例如导数,差异和“峰值”)。较大的培训语料库不一定会带来更好的令牌化质量,同时通过消除统计上弱的证据来压缩模型往往会提高性能。根据语言,提出的无监督的令牌化技术可提供比基于词典的质量更好或与基于词典的质量。
In the presented study, we discover that the so-called "transition freedom" metric appears superior for unsupervised tokenization purposes in comparison to statistical metrics such as mutual information and conditional probability, providing F-measure scores in range from 0.71 to 1.0 across explored multilingual corpora. We find that different languages require different offshoots of that metric (such as derivative, variance, and "peak values") for successful tokenization. Larger training corpora do not necessarily result in better tokenization quality, while compressing the models by eliminating statistically weak evidence tends to improve performance. The proposed unsupervised tokenization technique provides quality better than or comparable to lexicon-based ones, depending on the language.