论文标题

主要数据库:在通用柱数据文件格式上支持快速的交易工作负载

Mainlining Databases: Supporting Fast Transactional Workloads on Universal Columnar Data File Formats

论文作者

Li, Tianyu, Butrovich, Matthew, Ngom, Amadou, Lim, Wan Shen, McKinney, Wes, Pavlo, Andrew

论文摘要

现代数据处理工具的扩散已引起开源柱数据格式。这些格式的优点是,它们可以帮助组织避免反复将数据转换为每个应用程序的新格式。但是,这些格式仅是阅读的,组织必须使用重量重量转换过程来加载来自在线交易处理(OLTP)系统的数据。我们旨在通过为内存数据库管理系统(DBMS)开发存储架构来减少甚至消除此过程,该系统意识到其数据最终使用并以通用开源形式排放柱状存储块。我们将放松介绍给常见的分析数据格式,以有效地更新记录,并依靠轻巧的转换过程,以将块转换为寒冷时读取优化的布局。我们还描述了如何通过最小的序列化开销访问第三方分析工具的数据。为了评估我们的工作,我们根据Apache Arrow格式实施了存储引擎,并将其集成到DB-X DBMS中。我们的实验表明,我们的方法可以通过专用的OLTP DBMS实现可比性的性能,同时比现有方法更快地向外部数据科学和机器学习工具实现了量的数据导出速度。

The proliferation of modern data processing tools has given rise to open-source columnar data formats. The advantage of these formats is that they help organizations avoid repeatedly converting data to a new format for each application. These formats, however, are read-only, and organizations must use a heavy-weight transformation process to load data from on-line transactional processing (OLTP) systems. We aim to reduce or even eliminate this process by developing a storage architecture for in-memory database management systems (DBMSs) that is aware of the eventual usage of its data and emits columnar storage blocks in a universal open-source format. We introduce relaxations to common analytical data formats to efficiently update records and rely on a lightweight transformation process to convert blocks to a read-optimized layout when they are cold. We also describe how to access data from third-party analytical tools with minimal serialization overhead. To evaluate our work, we implemented our storage engine based on the Apache Arrow format and integrated it into the DB-X DBMS. Our experiments show that our approach achieves comparable performance with dedicated OLTP DBMSs while enabling orders-of-magnitude faster data exports to external data science and machine learning tools than existing methods.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源