论文标题
QD-Tree:学习大数据分析的数据布局
Qd-tree: Learning Data Layouts for Big Data Analytics
论文作者
论文摘要
公司今天以空前且加速的量表收集数据,因此需要在大型数据集上运行查询越来越重要。在大多数商业数据库系统中,基于柱状的数据组织和压缩等技术已成为标准实践。但是,将记录最佳分配给存储时数据块的问题仍然开放。例如,当今的系统通常按到达时间分组为行组,或基于选定字段分区数据。但是,对于给定的工作负载,此类技术无法优化查询访问的块数量的重要指标。该指标直接与大多数分析查询的I/O成本(因此性能)有关。此外,他们无法利用额外的可用存储空间来进一步推动该指标。 在本文中,我们提出了一个称为查询数据路由树(QD-Tree)的新框架,以解决此问题,并根据贪婪和深厚的加固学习技术提出了两种算法。基准和实际工作负载的实验表明,与当前的阻止方案相比,QD-Tree可以提供超过数量级的物理加速度,并且可以基于选择性的数据跳过数据的下限2倍,同时提供创建块的完整语义描述。
Corporations today collect data at an unprecedented and accelerating scale, making the need to run queries on large datasets increasingly important. Technologies such as columnar block-based data organization and compression have become standard practice in most commercial database systems. However, the problem of best assigning records to data blocks on storage is still open. For example, today's systems usually partition data by arrival time into row groups, or range/hash partition the data based on selected fields. For a given workload, however, such techniques are unable to optimize for the important metric of the number of blocks accessed by a query. This metric directly relates to the I/O cost, and therefore performance, of most analytical queries. Further, they are unable to exploit additional available storage to drive this metric down further. In this paper, we propose a new framework called a query-data routing tree, or qd-tree, to address this problem, and propose two algorithms for their construction based on greedy and deep reinforcement learning techniques. Experiments over benchmark and real workloads show that a qd-tree can provide physical speedups of more than an order of magnitude compared to current blocking schemes, and can reach within 2X of the lower bound for data skipping based on selectivity, while providing complete semantic descriptions of created blocks.