论文标题
有效地在Apache Kafka中分区一个主题
On Efficiently Partitioning a Topic in Apache Kafka
论文作者
论文摘要
Apache Kafka解决了通过出版订阅消息传递系统向不同消费者传递极端高批量事件数据的一般问题。它使用分区来扩展许多经纪人的主题,供生产者并行编写数据,并促进消费者的平行阅读。即使Apache Kafka提供了一些优化,但它并不能严格定义每个主题如何有效地分配到分区中。为了提高Apache Kafka群集性能,所需的良好成熟的微调仍然是一个开放的研究问题。在本文中,我们首先为给定主题建模Apache Kafka主题分区过程。然后,考虑到吞吐量,OS负载,复制延迟和不可用的一组经纪人,约束和应用程序要求,我们制定了优化问题,即查找需要多少个分区,并表明它在计算上是一个整数程序。此外,我们提出了两个简单但有效的启发式方法来解决问题:第一个尝试最小化,第二次尝试最大化集群中使用的经纪人数量。最后,我们通过大规模模拟评估其性能,考虑到Microsoft和Confluent提供的一些基准测试标准。我们证明,与建议不同,所提出的启发式方法尊重复制潜伏期的硬性约束并执行更好的W.R.T.不可用的时间和操作系统加载,以更谨慎的方式使用系统资源。
Apache Kafka addresses the general problem of delivering extreme high volume event data to diverse consumers via a publish-subscribe messaging system. It uses partitions to scale a topic across many brokers for producers to write data in parallel, and also to facilitate parallel reading of consumers. Even though Apache Kafka provides some out of the box optimizations, it does not strictly define how each topic shall be efficiently distributed into partitions. The well-formulated fine-tuning that is needed in order to improve an Apache Kafka cluster performance is still an open research problem. In this paper, we first model the Apache Kafka topic partitioning process for a given topic. Then, given the set of brokers, constraints and application requirements on throughput, OS load, replication latency and unavailability, we formulate the optimization problem of finding how many partitions are needed and show that it is computationally intractable, being an integer program. Furthermore, we propose two simple, yet efficient heuristics to solve the problem: the first tries to minimize and the second to maximize the number of brokers used in the cluster. Finally, we evaluate its performance via large-scale simulations, considering as benchmarks some Apache Kafka cluster configuration recommendations provided by Microsoft and Confluent. We demonstrate that, unlike the recommendations, the proposed heuristics respect the hard constraints on replication latency and perform better w.r.t. unavailability time and OS load, using the system resources in a more prudent way.