Translation software allows you to efficiently store large amounts of data in DNA molecules

0

DNA offers a compact way to store large amounts of data at a lower cost. Los Alamos National Laboratory developed the ADS Codex, which converts 0’s and 1’s in digital computer files into 4-character DNA codes.

ADS Codex can convert binary data to nucleotides and sequence them intramolecularly as files for later retrieval, resulting in potential savings and compact “cold storage”.

Support large collaborative projects for storing large amounts of data DNA A team led by the National Laboratory of Los Alamos National Laboratory has developed significant achievement technology to convert digital binary files into the four-letter genetic alphabet required for molecular storage.

“Our software, Adaptive DNA Storage Codec (ADS Codex), turns data files from what computers understand into what biology understands,” said Latchesar Ionkov, computer scientist at Los Alamos and principal investigator of the project. Said. “It’s like translating from English to Chinese, but it’s just more difficult.”

“Our software, Adaptive DNA Storage Codec (ADS Codex), turns data files from what computers understand into what biology understands. “- Latchesar Ionkov

This work is an important part of the Intelligence Advanced Research Projects Activity (IARPA) Molecular Information Storage (MIST) program, which brings cheaper, larger, and more sustainable storage to government and private sector big data operations. MIST’s short-term goal is to write 1 terabyte (1 trillion bytes) and read 10 terabytes for $ 1,000 in 24 hours. Other teams have improved the writing (DNA synthesis) and retrieval (DNA sequencing) components of the initiative, and Los Alamos is working on coding and decoding.

Bradley Settlemeyer, storage systems researcher and systems programmer specializing in high performance computing at Los Alamos, said: “DNA storage has very long data retention and very high data density, which can confuse the problem. reflection on archive storage. You can store all of your YouTube in a refrigerator instead of several acres of data centers. However, researchers must first overcome some of the difficult technical hurdles associated with integrating different technologies. “

Not lost in the translation

Compared to traditional long-term storage methods that use pizza-sized tape reels, DNA storage is potentially cheaper, much more physically compact, energy efficient, and lasts longer. DNA lasts for hundreds of years and is not needed. maintenance. Files stored in DNA can also be copied very easily at a very low cost.

The storage density of DNA is astounding. Please consider this. Humanity will produce about 33 zettabytes by 2025. That is 3.3 followed by 22 zeros. All of this information fits comfortably in the ping-pong ball. The Library of Congress has about 74 terabytes, or 74 million bytes of information. There are 6,000 such libraries that hold DNA archives the size of a poppy seed. Facebook’s 300 petabytes (300,000 terabytes) can be stored in half the poppy.

The encoding of a binary file in a molecule is done by DNA synthesis. Synthesis, a fairly well understood technique, organizes the components of DNA into different arrangements. These are represented by a sequence of letters A, C, G and T. These are the basis of all DNA codes and provide instructions for the construction of all living things. Things on earth.

The Los Alamos team’s ADS codex explains exactly how to convert binary data (all 0’s and 1’s) into a sequence of four letter combinations of A, C, G, and T. The Codex also handles decoding reconversion. in binary. DNA can be synthesized in many ways and ADS Codex can process them all. The Los Alamos team has completed version 1.0 of the Codex ADS and plans to use it in November 2021 to evaluate storage and research systems developed by other MIST teams.

Unfortunately, DNA synthesis can make coding errors, so ADS Codex addresses two major obstacles to creating DNA data files.

First, the error rate when writing to molecular storage was so high compared to traditional digital systems that the team had to come up with a new error correction strategy. Second, DNA storage errors come from different sources than in the digital world, making them difficult to correct.

“On digital hard drives, reversing 0 to 1 and vice versa causes bit errors, but with DNA there are additional problems due to insertion and deletion errors,” Ionkov said. .. “I write A, C, G, T, but when I try to write A nothing is displayed, so the letter sequence shifts to the left or I type AAA. The normal error correction code works fine. It does not work. “

ADS Codex adds additional information called an error detection code that you can use to validate your data. When the software converts the data back to binary, it tests whether the code matches. Otherwise, ACOMA will attempt to remove or add nucleotides until validation is successful.

Smart scaling

Large warehouses today have the largest data centers, with more than 1,000 billion bytes of exabytes of storage. This type of digital data center can cost billions of dollars to build, power and operate, and the need for data storage has grown exponentially, so it may not be the best option. I have.

Long-term storage with cheaper media is important for national security missions such as Los Alamos. “Los Alamos has some of the oldest digital data and largest data stores since the 1940s,” says Settlemyer. “It’s still invaluable. We keep data forever, so we’ve been at the forefront of technology for a long time when it comes to finding refrigeration solutions.

Settlemyer said DNA storage can be a disruptive technology as it moves back and forth between innovative disciplines. Project MIST is among the former storage vendors that make tapes, DNA synthesizers, DNA sequencing companies, and high-performance computing organizations like Los Alamos pushing computers to an unprecedented scale of scientific simulation. . Inspire a new coalition. It gives you an incredible amount of data to analyze.

Dig deeper into DNA

When most people think of DNA, they think of life, not computers. However, DNA itself is a four-letter code that conveys information about living things. DNA molecules are made up of four bases or nucleotides, adenine (A), thymine (T), guanine (G) and cytosine (C), each identified by a letter.

These bases are coiled in a twisted chain (the familiar double helix) to form a molecule. Placing these letters in a sequence creates a code that tells the body how to train it. The complete set of DNA molecules makes up the genome, the model of your body.

By synthesizing DNA molecules and creating them from scratch, the researchers found that they could specify or write long strings of letters A, C, G, and T and re-read their sequences. This process is similar to how a computer uses 0s and 1s to store information. Although this method has been proven to work, reading and writing DNA-encoded files is currently very time consuming, Ionkov said.

“Adding a single nucleotide to DNA is very slow. It takes a minute, ”says Ionkov. “Imagine writing a file to a hard drive that takes over a decade. This problem is therefore solved by massively parallelizing. You have tens of millions of molecules at the same time to speed it up. write.”

While different companies are working on different synthesis methods to solve this problem, ADS Codex can adapt to any approach.

Funding for Codex ADS has been provided by the Intelligence Advanced Research Projects Activity (IARPA), a research institute reporting to the State’s Director of National Intelligence.

Translation software allows you to efficiently store large amounts of data in DNA molecules

Source link translation software allows you to efficiently store large amounts of data in DNA molecules


Source link

Leave A Reply

Your email address will not be published.