论文标题
DNA数据存储的信息理论基础
Information-Theoretic Foundations of DNA Data Storage
论文作者
论文摘要
由于其寿命和大量信息密度,DNA是存档数据存储的有吸引力的媒介。多亏了技术的快速进步,许多实验存储系统证明,DNA存储实际上变得可行,这是我们社会对数据存储不断增长的需求的有前途解决方案。尽管在生物中,DNA分子可以由数百万个核苷酸组成,但由于技术的限制,实际上,数据存储在许多短DNA分子上,这些分子保存在DNA库中,不能在空间上排序。此外,测序,合成和处理中的缺陷以及存储期间的DNA衰减,将随机噪声引入系统,从而使在DNA中可靠地存储和检索信息的任务。这种独特的设置提出了一个自然的信息理论问题:可以可靠地存储多少信息并从数百万个短噪声序列中重建?该专着的目的是通过讨论存储有关DNA信息的基本限制来解决这个问题。我们提出了一个概率的通道模型,该模型由当前的技术限制和测序,该模型捕获了DNA存储系统的三个关键独特方面:(1)数据写入许多以无序方式存储的短DNA分子上; (2)分子被噪声损坏,(3)通过从DNA池中随机采样读取数据。我们的目标是研究这些关键方面对DNA存储系统能力的影响。我们旨在为分析这些渠道分析的信息理论基础,开发以实现可实现性和交流论证的工具,而不是专注于编码理论考虑和计算高效的编码和解码。
Due to its longevity and enormous information density, DNA is an attractive medium for archival data storage. Thanks to rapid technological advances, DNA storage is becoming practically feasible, as demonstrated by a number of experimental storage systems, making it a promising solution for our society's increasing need of data storage. While in living things, DNA molecules can consist of millions of nucleotides, due to technological constraints, in practice, data is stored on many short DNA molecules, which are preserved in a DNA pool and cannot be spatially ordered. Moreover, imperfections in sequencing, synthesis, and handling, as well as DNA decay during storage, introduce random noise into the system, making the task of reliably storing and retrieving information in DNA challenging. This unique setup raises a natural information-theoretic question: how much information can be reliably stored on and reconstructed from millions of short noisy sequences? The goal of this monograph is to address this question by discussing the fundamental limits of storing information on DNA. Motivated by current technological constraints on DNA synthesis and sequencing, we propose a probabilistic channel model that captures three key distinctive aspects of the DNA storage systems: (1) the data is written onto many short DNA molecules that are stored in an unordered fashion; (2) the molecules are corrupted by noise and (3) the data is read by randomly sampling from the DNA pool. Our goal is to investigate the impact of each of these key aspects on the capacity of the DNA storage system. Rather than focusing on coding-theoretic considerations and computationally efficient encoding and decoding, we aim to build an information-theoretic foundation for the analysis of these channels, developing tools for achievability and converse arguments.