A DNA synthesis and decoding strategy tailored for storing and retrieving digital information
By Eriona Hysolli
(BOSTON) – A team at Harvard’s Wyss Institute for Biologically Inspired Engineering and Harvard Medical School (HMS) has developed a low-cost DNA storage technique that enables encoding digital information at large scale. The approach was published recently in Nature Communications.
Information storage has gone through different evolutions from stone to parchment, paper, tape, hard drives, CD/DVDs, and flash drives, to name the main archiving media. While the need for storage technologies has seen dramatic increases over time, these approaches still have drawbacks and limitations such as relatively short longevity and low density of storage. With our drive to generate exponential amounts of data, it becomes increasingly essential to find new ways to overcome these limitations and preserve data safely with guaranteed access for the future.
George Church, Ph.D., who is a Core Faculty member at the Wyss Institute and Professor of Genetics at Harvard Medical School and of Health Sciences and Technology at Harvard and the Massachusetts Institute of Technology (MIT), first pioneered the idea of using short synthetic DNA as a long-term information storage medium. His team first converted a complete book, including 5.27 megabits of text and images, into a binary digital code, which they then encoded in DNA, and finally decoded again using next-generation sequencing technology. It is estimated that 1 gram of DNA can hold up to ~215 petabytes (1 petabyte = 1 million gigabytes) of information, although this number fluctuates as different research teams break new grounds in testing the upper storage limit of DNA. Assuming the Avengers Endgame movie in 720p HD takes up 6 gigabytes, you can store ~36 million copies of the movie in one gram of DNA. Global digital data is expected to grow to 175 zettabytes (1 zettabyte = 1 million petabytes) by 2025, and because all the digital data in the world can be theoretically stored in ~81kg of DNA, DNA storage is being actively pursued as a compelling storage medium for the future.
What makes DNA so appealing for data storage is: 1) its high density – you can store a lot of data in tiny amounts of mass; 2) stability – dried DNA stored in a cool environment can last thousands of years (oldest DNA sequenced is 700 000 year old); 3) energy-efficiency – it doesn’t take much energy and space to store DNA, just enough to keep it cool; 4) relevancy – biological systems use DNA, and therefore DNA with its possibilities to encode and decode information will not become obsolete like many other data storage media. There are, however, two main drawbacks that at present limit DNA as a universal archiving medium. While both synthesis and sequencing costs are at an all-time low, the costs of synthesizing DNA and then retrieving the data via next-generation sequencing are still much higher than those of using conventional storage media. “In our lab, we continuously develop new methods for biology, medicine, and technology, but the key is to make them multiplexable. DNA storage holds great potential, and our model is faster, cheaper, and at scale,” said Church.
In this study, which was led by Henry Lee, Ph.D., a Postdoctoral Fellow in the Church lab, the researchers developed a DNA storage method that utilizes template-independent de novo enzymatic DNA synthesis to generate many short pieces of DNA without the need for a preexisting strand of DNA. The information is encoded in “trits” rather than the binary code. Trits are base transitions (i.e., A ->C, G -> T, etc.) denoted as 0, 1, 2, instead of converting digital information into base sequences using 00, 01, 10, and 11 for the four bases composing DNA. This approach allows for repeats of the same base in the elongating strand of DNA without loss of information, since only transitions from one base to another are decoded. To ensure that fragments of DNA are made at relatively similar lengths, the team added an enzyme that degrades DNA and stops chains from elongating beyond the desired average length. This is important to get consistent pools of DNA that are read with similar efficiency when it is time to decode the data. Furthermore, the team used nanopore sequencing, which allows for user-defined allocation of sequencing depth and data recovery. This approach reduces reagent use and data volume to get the same information at a fraction of the cost.
Both de novo enzymatic DNA synthesis and nanopore sequencing could help reduce costs associated with the use of DNA for the storage of digital information in practice. “What we did here was develop new technology specific for DNA storage. DNA synthesis and sequencing accuracy do not have to be as stringent as in biological applications,” explained Lee. “In addition to enhancements in DNA synthesis, we leveraged error-correcting codecs widely used for internet, cell phones, and other applications to decode the digital information,” he added. In short, by utilizing the right codec, data can be perfectly retrieved from a pool of DNA fragments that can accommodate up to 30% synthesis and sequencing errors. This strategy cuts in-house sequencing and synthesis costs. An advantageous byproduct of this all-in-one platform that further decreases costs is also the elimination of third-party synthetic DNA providers adopted by existing commercial DNA storage platforms.
The length of DNA fragments generated by the template-independent polymerase, known as TdT (terminal deoxynucleotidyl transferase), is shorter than that of DNA fragments generated by the more commonly used phosphoramidite chemistry, and researchers will still have to figure out how this might limit the encoded volume of information. Studies thus far have reported smaller scale (hundreds of bits of encoded information) encoding in short DNA fragment pools. There is increasing interest in enzymatic-based approaches for de novo synthesis of DNA to achieve higher DNA fragments that could help increase the volume of information stored. To date, however, enzyme-based DNA synthesis over longer fragments carries significant error rates. This study has shown that potential error rates in synthesis and sequencing are manageable, and in future work, the team will further optimize their technology to accommodate longer reads.
Other first co-authors on the study are HMS postdoctoral fellow Reza Kalhor, Ph.D., and Technicolor Research and Innovation Lab researcher Naveen Goela, Ph.D.; and in addition, the study was co-authored by Jean Bolot, Ph.D., from the Technicolor Research and Innovation Lab. This work was supported by funding from National Institutes of Health Grant R01-MH103910-02, Department of Energy Grant DE-FG02-02ER63445, AWS Cloud Credits for Research, and Harvard’s Wyss Institute for Biologically Inspiring Engineering.