论文标题
多文本 - east
MULTEXT-East
论文作者
论文摘要
多种主义语言资源,一种用于语言工程研究的多语言数据集,重点介绍了语言描述的形式含义。多文本 - east数据集包括基于老鹰的形态句法规范,词法词典和带注释的多语言语料库。平行语料库是乔治·奥威尔(George Orwell)的小说“ 1984”,句子对齐并包含手工验证的形态句法描述和引理。资源是在XML中均匀编码的,使用文本编码计划指南TEI P5,涵盖16种语言:保加利亚语,克罗地亚语,捷克语,英语,爱沙尼亚语,爱沙尼亚语,匈牙利语,马其顿,波兰语,波兰语,Resian,Resian,Resian,Romanian,Rosanian,Russian,Serbian,Slovak,Slovak,Slovak,Slovene,Slovene和Ukrainian和Ukrainian和Ukrainian。该数据集已被广泛记录,并且可免费用于研究目的。该案例研究赋予了多文本资源发展的历史,展示了它们的编码和组件,讨论了相关工作并得出一些结论。
MULTEXT-East language resources, a multilingual dataset for language engineering research, focused on the morphosyntactic level of linguistic description. The MULTEXT-East dataset includes the EAGLES-based morphosyntactic specifications, morphosyntactic lexicons, and an annotated multilingual corpora. The parallel corpus, the novel "1984" by George Orwell, is sentence aligned and contains hand-validated morphosyntactic descriptions and lemmas. The resources are uniformly encoded in XML, using the Text Encoding Initiative Guidelines, TEI P5, and cover 16 languages: Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbian, Slovak, Slovene, and Ukrainian. This dataset is extensively documented, and freely available for research purposes. This case study gives a history of the development of the MULTEXT-East resources, presents their encoding and components, discusses related work and gives some conclusions.