论文标题
半监督的深度聚类管道,用于从文本中进行采矿意愿
A Semi-Supervised Deep Clustering Pipeline for Mining Intentions From Texts
论文作者
论文摘要
从大量自然语言输入中挖掘潜在意图是帮助数据分析师设计并改善智能虚拟助手(IVA)的关键步骤,以供客户服务。为了帮助数据分析师在此任务中,我们提出了Verint Intent Manager(VIM),该平台结合了无监督和半监督的方法,以帮助分析师快速浮出水面并组织从对话文本中组织相关的用户意图。对于数据的初步探索,我们利用一种新颖的无监督和半监督管道,该管道整合了高性能语言模型的微调,分布式K-NN图形构建方法和社区检测技术,用于挖掘文本的意图和主题。进行微调步骤是必要的,因为预训练的语言模型无法编码文本以有效地表面特定的聚类结构,而当目标文本来自看不见的域或聚类任务不是主题检测。为了灵活性,我们部署了两种聚类方法:必须指定簇数的位置以及以相当的聚类质量自动检测到的簇数,但以额外的计算时间为代价。我们描述了应用程序和部署,并在三个文本挖掘任务上使用BERT证明其性能。我们的实验表明,BERT开始使用标记的子集的0.5%的任务数据产生更好的任务感知表示。当Bert用仅2.5%的任务数据的标记子集进行微调时,聚类质量超过了最先进的结果。随着VIM应用程序中的部署,此灵活的聚类管道可产生高质量的结果,提高数据分析师的性能,并减少从客户服务数据中表达意图所需的时间,从而减少在新域中构建和部署IVA所需的时间。
Mining the latent intentions from large volumes of natural language inputs is a key step to help data analysts design and refine Intelligent Virtual Assistants (IVAs) for customer service. To aid data analysts in this task we present Verint Intent Manager (VIM), an analysis platform that combines unsupervised and semi-supervised approaches to help analysts quickly surface and organize relevant user intentions from conversational texts. For the initial exploration of data we make use of a novel unsupervised and semi-supervised pipeline that integrates the fine-tuning of high performing language models, a distributed k-NN graph building method and community detection techniques for mining the intentions and topics from texts. The fine-tuning step is necessary because pre-trained language models cannot encode texts to efficiently surface particular clustering structures when the target texts are from an unseen domain or the clustering task is not topic detection. For flexibility we deploy two clustering approaches: where the number of clusters must be specified and where the number of clusters is detected automatically with comparable clustering quality but at the expense of additional computation time. We describe the application and deployment and demonstrate its performance using BERT on three text mining tasks. Our experiments show that BERT begins to produce better task-aware representations using a labeled subset as small as 0.5% of the task data. The clustering quality exceeds the state-of-the-art results when BERT is fine-tuned with labeled subsets of only 2.5% of the task data. As deployed in the VIM application, this flexible clustering pipeline produces high quality results, improving the performance of data analysts and reducing the time it takes to surface intentions from customer service data, thereby reducing the time it takes to build and deploy IVAs in new domains.