论文标题
通过非结构化数据的基于机器学习的查询的语义索引
Semantic Indexes for Machine Learning-based Queries over Unstructured Data
论文作者
论文摘要
现在通常通过使用计算昂贵的深神经网络或人类标记来产生结构化信息,例如视频中的对象类型和位置来查询非结构化数据(例如,视频或文本)。为了加速查询,许多最近的系统(例如Blazeit,Noscope,Tahoma,Supg等)训练一个特定特定的代理模型,以近似较大的目标标签(即这些昂贵的神经网络或人类标签)。这些模型返回代理分数,然后在查询处理算法中使用。不幸的是,代理模型通常必须接受每个查询的培训,并且需要来自目标标签的大量注释。 在这项工作中,我们开发了一个索引(可训练的语义索引,tasti),该索引同时消除了对各种代理的需求,并且比以前的索引更有效地构造。 Tasti通过利用给定数据集中的记录之间的语义相似性来实现这一目标。具体而言,它为每个记录产生嵌入式,以便具有近距离嵌入的记录具有相似的目标标签输出。然后,tasti通过嵌入产生高质量的代理分数,而无需训练众多代理。这些分数可用于现有基于代理的查询处理算法(例如,用于聚合,选择等)。我们从理论上分析了tasti,并表明嵌入训练错误可确保自然查询的下游查询准确性。我们在五个视频,文本和语音数据集以及三种查询类型上评估Tasti。我们表明,比为当前基于代理的方法生成注释,tasti的索引的构造价格可能更低,构造价格便宜,并加速查询高达24 $ \ times $。
Unstructured data (e.g., video or text) is now commonly queried by using computationally expensive deep neural networks or human labelers to produce structured information, e.g., object types and positions in video. To accelerate queries, many recent systems (e.g., BlazeIt, NoScope, Tahoma, SUPG, etc.) train a query-specific proxy model to approximate a large target labelers (i.e., these expensive neural networks or human labelers). These models return proxy scores that are then used in query processing algorithms. Unfortunately, proxy models usually have to be trained per query and require large amounts of annotations from the target labelers. In this work, we develop an index (trainable semantic index, TASTI) that simultaneously removes the need for per-query proxies and is more efficient to construct than prior indexes. TASTI accomplishes this by leveraging semantic similarity across records in a given dataset. Specifically, it produces embeddings for each record such that records with close embeddings have similar target labeler outputs. TASTI then generates high-quality proxy scores via embeddings without needing to train a per-query proxy. These scores can be used in existing proxy-based query processing algorithms (e.g., for aggregation, selection, etc.). We theoretically analyze TASTI and show that a low embedding training error guarantees downstream query accuracy for a natural class of queries. We evaluate TASTI on five video, text, and speech datasets, and three query types. We show that TASTI's indexes can be 10$\times$ less expensive to construct than generating annotations for current proxy-based methods, and accelerate queries by up to 24$\times$.