论文标题

通过合并字节对编码器来重新访问以建模工业应用的正则生成

Revisiting Regex Generation for Modeling Industrial Applications by Incorporating Byte Pair Encoder

论文作者

Wang, Desheng, Liu, Jiawei, Qi, Xiang, Sun, Baolin, Zhang, Peng

论文摘要

正则表达对于许多自然语言处理任务很重要,尤其是用于处理非结构化和半结构数据时。这项工作着重于自动产生正则表达式,并提出了一种新型的遗传算法来解决这个问题。与从字符级别生成正则表达式的方法不同,我们首先使用字节对编码器(BPE)提取一些频繁的项目,然后将其用于构造正则表达式。我们的遗传算法的适应性功能包含多个目标,并基于包括交叉和突变操作在内的进化程序解决。在健身函数中,我们采用了生成的正则表达式的长度,阳性训练样本的最大匹配字符和样本以及负面训练样本的最小匹配字符和样品。此外,为了加速训练过程,我们对遗传算法的种群大小进行指数衰减。在13种具有挑战性的数据集上测试了我们的方法以及强大的基线。结果证明了我们方法的有效性,这表现优于10种数据的基线,平均取得了近50%的进步。通过进行指数衰减,训练速度的速度比不使用指数衰减的方法快100倍。总而言之,我们的方法具有有效性和效率,可以为行业应用实施。

Regular expression is important for many natural language processing tasks especially when used to deal with unstructured and semi-structured data. This work focuses on automatically generating regular expressions and proposes a novel genetic algorithm to deal with this problem. Different from the methods which generate regular expressions from character level, we first utilize byte pair encoder (BPE) to extract some frequent items, which are then used to construct regular expressions. The fitness function of our genetic algorithm contains multi objectives and is solved based on evolutionary procedure including crossover and mutation operation. In the fitness function, we take the length of generated regular expression, the maximum matching characters and samples for positive training samples, and the minimum matching characters and samples for negative training samples into consideration. In addition, to accelerate the training process, we do exponential decay on the population size of the genetic algorithm. Our method together with a strong baseline is tested on 13 kinds of challenging datasets. The results demonstrate the effectiveness of our method, which outperforms the baseline on 10 kinds of data and achieves nearly 50 percent improvement on average. By doing exponential decay, the training speed is approximately 100 times faster than the methods without using exponential decay. In summary, our method possesses both effectiveness and efficiency, and can be implemented for the industry application.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源