朝向可扩展的数据框系统

论文标题

朝向可扩展的数据框系统

Towards Scalable Dataframe Systems

论文作者

Petersohn, Devin, Macke, Stephen, Xin, Doris, Ma, William, Lee, Doris, Mo, Xiangxi, Gonzalez, Joseph E., Hellerstein, Joseph M., Joseph, Anthony D., Parameswaran, Aditya

论文摘要

数据范围是一个流行的抽象，可以代表，准备和分析数据。尽管DataFrame库在Rand Python中取得了显着的成功，但即使在中等大的数据集上，DataFrames仍面临性能问题。此外，关于数据帧语义的含糊不清。在本文中，我们为可扩展的数据框系统制定了愿景和路线图。为了证明这一领域的潜力，我们报告了建立Modin的经验，这是当今最广泛使用和最复杂的数据框架API的扩展实施，即Python的Pandas。以熊猫为参考，我们提出了一个简单的数据模型和代数，以进行数据范围，以在现场进行讨论。鉴于这个基础，我们制定了一个开放研究机会的议程，其中数据框的不同特征将需要在数据管理的许多方面扩展最新技术。我们讨论签名数据框架功能的含义，包括灵活的模式，排序，行/列等效性以及数据/元数据流动性，以及基于零散的，基于反复试验的方法与DataFrames进行交互。

Dataframes are a popular abstraction to represent, prepare, and analyze data. Despite the remarkable success of dataframe libraries in Rand Python, dataframes face performance issues even on moderately large datasets. Moreover, there is significant ambiguity regarding dataframe semantics. In this paper we lay out a vision and roadmap for scalable dataframe systems. To demonstrate the potential in this area, we report on our experience building MODIN, a scaled-up implementation of the most widely-used and complex dataframe API today, Python's pandas. With pandas as a reference, we propose a simple data model and algebra for dataframes to ground discussion in the field. Given this foundation, we lay out an agenda of open research opportunities where the distinct features of dataframes will require extending the state of the art in many dimensions of data management. We discuss the implications of signature data-frame features including flexible schemas, ordering, row/column equivalence, and data/metadata fluidity, as well as the piecemeal, trial-and-error-based approach to interacting with dataframes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题