Probing the Physical Limits of Reliable DNA Data Retrieval
Retrieving data from massive amounts of digital data stored in synthetic DNA requires an understanding of the limits of that system. Our work physically stresses the upper limits of DNA data storage in several ways and shows we could achieve an upper bound of 17 EB/gram.
When reading scientific literature, it can seem as if the authors simply designed an experiment, performed it, and reported their findings in one easy, fluid process. While conceptually we all know this is rarely, if ever, how science works, we hardly ever see the more human side of the work behind the paper. In this Behind the Paper piece, I hope to add my voice to the growing number of researchers who find it important to share not only scientific findings, but also the meticulous, humbling process of science in a more accessible manner than the original publication.
To understand the work we at the Molecular Information Systems Lab (MISL) presented, we should first discuss what DNA data storage is and why researchers worldwide are working on it.
Say you have many digital files, anything from plain text to images or HD videos, and you want to store them for potentially many years before you need to access them. The current best commercial state-of-the-art technology for this archival storage of digital data is magnetic tape. It looks like a large cassette tape, and rows upon rows of them are shelved like books in warehouses all over the world. In order to store all the world's digital data, these warehouses would take up an incredible amount of space.
As a more durable and compact alternative, several different labs are exploring avenues to store digital data in synthetic DNA. Among other advantages, DNA is so dense that some quick back-of-the-napkin calculations show we could easily store all the data contained in a Walmart-sized magnetic tape warehouse in a bundle of synthetic DNA the size of a few sugar cubes.
In the DNA data storage community, it's well known that DNA is dense, but the field is new enough that few experiments have pushed the physical limits of this process (if interested, see Erlich et al. 2017  and Tomek et al. 2019 ). Before investing significantly more time and effort into developing DNA data storage methods, we were motivated to investigate the practicality of such an effort. Specifically, we wanted to know if the extremely high data density enabled by DNA storage would cause problems while recovering data. At a large scale, we didn't think these problems would be small ("our DNA took a couple more minutes to amplify"), we thought the problems would be catastrophic ("random DNA sequences got amplified instead of the DNA sequences we wanted...we have lost the ability to practically recover the data").
It is important to note here that the method of DNA data storage we employ at MISL relies on random access, the ability to select and read arbitrary information of interest (Figure 1).
Figure 1. The MISL DNA storage process. Three digital files (green, blue, teal) are translated from binary bits to DNA nucleotides during encoding. The resulting DNA sequences are synthesized and then stored until we wish to retrieve a file. We then perform random access to extract the DNA corresponding to our file of interest, sequence the resulting DNA, and translate the data from nucleotides back into the original bits. Many thanks to Yuan-Jyue Chen, who made this figure.
When talking about random access in this context, I find it's helpful to imagine it's like trying to find one book in the Library of Congress after a terrible wind storm. Unfortunately for you, all the pages from all the books are scattered around the entire building. Luckily, on each page is the title and author of the book it belongs to. You can't walk out with all the pages in the Library of Congress, so you grab all the papers you come into contact with and stuff them into your backpack. Without random access, you, now playing the role of the DNA sequencer in this analogy, have to read the entirety of every single page you pull out of your backpack and then decide whether or not you want to keep it to reconstruct the book you're looking for. This takes up a lot of resources because you're reading so much junk. Alternatively, you can use random access to magically grab only the pages with the title and author you're interested in from your backpack and then read those, which is so much more efficient because you're only interested in reading one book. In MISL's protocol, this random access is done with PCR, so it's really more like making thousands of copies of only the pages with the title and author you're interested in so that all the other junk pages you grabbed only make up a tiny fraction of the pages you now have.
To be clear: there are several different DNA data storage methods and not all of them utilize random access, and some perform random access in different ways. There are pros and cons to each method and the findings presented here are specific to the method presented in Figure 1.
In our work, we stress-tested our DNA data storage method's physical limits by doing two things, outlined in Figure 2:
- We reduced the copy number (number of copies of each DNA sequence) by simply diluting our sample in water (blue tubes in Figure 2). In the Library of Congress analogy, this is figuring out how many copies of each book you need to start off with in order to make sure you grab enough pages from that book when you put a small sample of pages into your backpack.
- We combined our files with a bunch of unique/random sequences of DNA into one PCR reaction (random access), simulating increased storage complexity with a data density of 150GB of data per microliter (the orange tubes in Figure 2). In the Library of Congress analogy, this is like adding a ton of extra pages to sort through.
Figure 2. (a) The random access stage within the DNA data storage workflow. (b) Left. We started with a pool of DNA with nine files and decided to access just three of them, each of different sizes (in green). In 1 microliter of the original pool, each sequence of DNA was present about 200 times. (b) Right. Aliquots of that pool were then diluted to reduce the file copy number, either in water (blue tubes) or complex conditions with lots of extra unique DNA sequences (orange tubes). As an aside, making those simple tubes in Inkscape was surprisingly difficult for me, and I suspect all my future figures requiring tubes will feature these exact same ones.
Our intuition was that these complex storage conditions would noticeably hinder our ability to recover our files relative to the simple water-diluted conditions, perhaps not even allowing the smallest file we tested to amplify. We also assumed that a significant number of DNA sequences would be missing once the copy number dropped below about 100 copies per PCR reaction.
The humbling beauty of science proved us wrong, on both counts.
In early 2017, copy numbers of roughly 7,000 to 700, 70 and 7 were tested. The trend was clear, the smaller the copy number, the more sequences are missing. However, the inflection point of this relationship was right between the last two data points. More data points would have to be examined in that copy number region to figure out what exactly was going on, but we didn't expect that to take long.
It took another few months to redo the experiment and improve the wet lab methods so that dilution was more consistent, but there was still the problem of trying to pinpoint exactly what our copy numbers were.
Determining precise copy numbers and calculating the confidence of each one of those values was surprisingly difficult in this experiment. We wanted to be as precise as possible, and that meant using qPCR. The problem was that the copy numbers were so low in our diluted samples (as you can see in Figure 2, we ended up focusing on copy numbers roughly 30 to 0.3) and the files accessed were so small that for several of the samples the qPCR machine couldn't distinguish between our real files and the empty negative control. This problem took several iterations to arrive at the final process presented in the paper.
By late July of 2017, the experiment was completed for the third and final time. By December of 2017, we could confidently say the following:
- We have not yet reached the DNA storage complexity limit at 150 GB/microliter.
- We can successfully decode all three files down to an average copy number of 10.
- Sequences are lost stochastically.
- If you are stuck sequencing a sample with low copy number, increasing your sequencing coverage is not going to help much. (To be honest, this interesting tidbit was discovered because we accidentally sequenced the small file with a coverage hundreds of times greater than we meant to in one of the sequencing runs.)
The field of DNA data storage has grown so much in the years since we started working on this paper, it's been a very exciting time. With experiments like this one continuing to lend support to the hypothesis that DNA data storage is dense and reliable, we continue to be optimistic about its future. While DNA synthesis (and to a lesser extent, sequencing) is still prohibitively expensive to employ DNA data storage at a large scale, the enthusiasm around it has not waned, as evidenced by the increasing amount of academic papers focused on all parts of DNA data storage pipeline and non-academic articles, such as "9 DNA Data Storage Companies to Watch".
Despite the substantial body of work that exists on DNA data storage, plenty of engineering work still needs to be done at large scale to scope out practical implementations. For instance, should we group or split files to form uniform blocks? If so, what size should those blocks be to incorporate the most data with the least amount of overhead? In practice, what sorts of files do people even want to store for long-term data storage? There are a myriad of other interesting questions to investigate at the intersection of multiple fields, and, in our opinion, this is a large part of what makes DNA data storage such a fun and exciting area to work in.
Banner photo credit: Tara Brown Photography / UW
Our paper: Organick, L., Chen, Y., Dumas Ang, S., Lopez, R., Liu, X., Strauss, K., and Ceze, L. Probing the physical limits of reliable DNA data retrieval. Nat Commun 11, 616 (2020). https://doi.org/10.1038/s41467-020-14319-8
 Yaniv Erlich, Dina Zielinski. DNA Fountain enables a robust and efficient storage architecture. Science 355, 6328 (2017). https://doi.org/10.1126/science.aaj2038
 Kyle J. Tomek, Kevin Volkel, Alexander Simpson, Austin G. Hass, Elaine W. Indermaur, James M. Tuck, and Albert J. Keung. Driving the scalability of DNA-based information storage systems. ACS Syn Bio 8, 6 (2019). https://doi.org/10.1021/acssynbio.9b00100