As big data keeps growing at an exponential rate, the need to find more compact storage solutions is becoming more pressing. Some researchers believe the solution may lie in DNA, which can condense huge amounts of data into microscopic spaces.
The size of data's future
As a reminder of just how big big data is: in 2012, the Harvard Business Review estimated that "2.5 exabytes of data are created each day,” that’s 2.5 billion gigabytes. Just one second of data transfer on the internet today exceeds what was to be found "in the entire internet just 20 years ago."
Just one business like Walmart can take in "more than 2.5 petabytes of data every hour from its customer transactions." If all that data were in paper, it would fill "about 20 million filing cabinets." While electronic formats are more compact than that, they would take up a significant amount of space.
The DNA solution to data storage made a splash this month when a team of researchers from EMBL-EBI (European Bioinformatics Institute) teamed up with Agilent to produce a retrievable sequence of synthesized DNA. It contains all of Shakespeare’s sonnets, an mp3 file of Martin Luther King, Jr.’s "I have a dream" speech, a picture file of EMBL-EBI, a PDF of Watson and Crick’s, "Molecular structure of nucleic acids," and a file describing the system of encoding.
After testing that the DNA sequencing read the data back accurately, the researchers submitted their findings to Nature last spring. It was published on January 23, 2013. They presented it as a viable solution to digital storage for inactive files: "Theoretical analysis indicates that our DNA-based storage scheme could be scaled far beyond current global information volumes and offers a realistic technology for large-scale, long-term and infrequently accessed digital archiving."
Longlife data
One of the researchers, Dr. Nick Goldman, extolled this form of data storage for its stability, estimating its lifespan to extend to at least 10,000 years. That’s in marked contrast to current options for "no-power" data storage, for magnetic tape rarely survives a mere 10 years.
In addition to the advantages of long-term stability with little maintenance, this form of DNA storage offers a new system of encoding that is more efficient than traditional binary code.
The encoding they used substitutes the letters, G, T, C, and A that represent the four bases of DNA for the 0s and 1s of binary code. The result is a much more compact representation of information. For example, as reported in the Guardian, it takes eight binary code digits to represent the letter T, in DNA sequencing, it is represented by 5 letters, TAGAT. Consequently, a single gram of DNA could store more data than a million compact discs.
Ode to coding
The difference in coding is the real contribution of this team of researchers, for they are not the first ones to think about using DNA for data. In December 2011, Applied Physics Letters published the findings of scientists from Taiwan and Germany who combined electrodes, silver nanoparticles, and salmon DNA to create a "write-once-read-many-times; (WORM) memory device," as reported in Gizmag. And in 2008, a research team succeeded in creating bacterial computers capable of data flipping and sorting.
Like the EMBL-EBI scientists, Harvard researchers succeeded in packing 5.27 million bits of data into a sequence of DNA. Their findings were published this past August in Science. Instead of using each of the four possible letters to form a new code, though, they adapted the DNA bases into the traditional binary pattern (as reported in New Scientist) with A and C serving as the 0, and the G and T functioning as a 1.
DNA: Don't expect it soon
While the DNA data storage is an exciting innovation, don’t expect to be able to convert all those bulky backup files into neat DNA sequences any time soon.
The high cost of DNA sequencing makes it far too expensive to serve as a storage solution for now. And even though the researchers are optimistic that it will become more affordable in the next 10 years, there are two other considerable drawbacks to this form of data storage -- as reported in Time:
The files cannot be updated; they would have to be set up into a new sequence for any modification.
Each file has to be decoded in its entirety, as it doesn’t allow access to a single component.
Still, the brave new world of data storage is one worth exploring. It's possible that it can bring about a paradigm shift in encoding data. Looking at alternatives to binary code may yield breakthroughs that can work even in other forms of programming for more flexible and efficient systems, even without DNA components.
Ariella,
User Rank: Blogger 2/9/2013 | 10:33:18 PM
Re: Interesting but.... @Saul I wonder if it is well-suited to images. As paintings decay over time, getting a high resolution photograph stored in a way that keeps it fresh and accurate over 10,000 years may be useful as a check on what the original shades and hues were, particularly if restoration is needed and we wish to avoid a repeat of the botched fresco.
Saul Sherry,
User Rank: Blogger 2/9/2013 | 7:07:50 PM
Re: Interesting but.... @legalcio... I can see that on monetary terms, the answer to your tweeting question would probably be a straight up no. But in terms of interest, it would be fascinating to have access to this repository of human idea regurgitation, and map how it changes past generations (looking at whatever platform comes to be the 'next' twitter). Interesting in an anthropological setting, and possible for a broader brand significance study. But will it matter to a bank looking to retain customers, or a doctor looking to cure a patient... no. And therefore, no investment will be justified.
Ariella,
User Rank: Blogger 2/8/2013 | 10:12:19 AM
Re: Interesting but.... @Susan At the rate we're going, it looks like data will continue to increase, as more and more information is tracked. Walmart's data likely includes what people searched for, as well as what they ultimately bought, when they shopped online. It also would track what types of devices they are using to access the site and how long and how often they visit. With smartphone sensors, even people's movements can be tracked just by virtue of Wi-Fi signals. So every step you take can contribute to the mounting piles of data.
DNA data storage offers the dual advantages of being incredibly compact and robust -- with an estimated life span of one thousand times that of current long-term storage options. Though it is not a feasible solution for now, thinking outside the box of binary code may lead to another solution that could be applied before a decade passes.
The other thing that crosses my mind is all this data worth storing for the long term? Does the shelf life for tweets really need to go beyond a year?
No tweet deserves long life storage. It would be silly to use an expensive storage usage, or something like the disapponting DNA data storage that will not happen for storing tweets.
However, it would be fantastic if someone could come up with a real solution, which actually could be applied and used to safely store impotant big data.
And the answer to my question about how and when DNA data storage would be available came immediately as I continued reading. :(
I was so excited thinking of all the possibilities and solutions this could bring that I couldn't wait to finish reading, and had to comment in the middle of having my thoughts.
Now I am so disappointed to learn that DNA data storage is simply not going to happen. Even if the cost becomes less expensive, if the files can't be updated it would be like having the same problem you have with paper storage: you simply have to create a new data entry. What's the point?
This was good for publicity, for them. Not a real contribution to anything. 10 years in today's agile world is the equivalent to 100 years. Who cares? We will be all death by then. I am so upset that I can hardly choose my words here.
"The files cannot be updated; they would have to be set up into a new sequence for any modification.
Each file has to be decoded in its entirety, as it doesn't allow access to a single component."
Do they plan to continue working on this project? Do they expect to find a way to be able to modify the DNA sequence in order to allow updates?
Do they plan to find a way to be able to decode single components? What's next in their research plan? Do they even have a plan?
Shakespeare's Sonets in a DNA sequence storage I think I read time ago. Maybe when thet started the project.
You say maybe this brings a shift in encoding data, at least. Maybe.
"this form of data storage for its stability, estimating its lifespan to extend to at least 10,000 years."
All this is super interesting.
So this could be the solution to storage, and the fears of losing data for one reason or another. If data can be safely stored for 10,000 years all our worries should be over, at leat in this department.
I wonder how and when DNA data storage could be available for everyone who needs to store data.
"If all that data were in paper, it would fill "about 20 million filing cabinets."
I can hardly imagine a room filled with 20 million filing cabinets. Less I can imagine how come someone could deal with such amount of data in paper form.
Electcronic formats can be handled in easier, but how much is big data still to grow?
Re: Interesting but.... @legalcio Maybe that's what the Library of Congress should do with its record of tweets -- store them in some DNA. But don't expect to see DNA around offices for data storage any time soon. Even the somewhat optimstic view of the researchers involved here is for it to happen in ten years.
For now the cost is prohibitive. But the expectation is that it will drop substantially over time. According to http://singularityhub.com/2012/09/17/new-software-makes-synthesizing-dna-as-easy-as-using-an-ipad/ The cost of DNA writing right now is about 25 cents per base pair. At that rate synthesizing an E. coli genome - a relatively small genome of 4.6 million base pairs - would still cost over $1M, too much for the average lab to pay, especially as they typically build up libraries of varied versions of the DNA. But the cost of DNA synthesis is dropping at a super-exponential rate, outpacing even Moore's Law. Amirav-Drory predicts it will hit 10 cents per base pair this year, and in the near future, when the cost of synthesis drops enough, gone will be the laborious days of manually splicing together pieces of DNA.
legalcio,
User Rank: Exabyte Executive 2/7/2013 | 1:27:40 PM
Interesting but.... The implications for Big Data, I assume, would be whole different breed of data scientists. Would the storage venue be too complex to take advantage of Big Data? The other thing that crosses my mind is all this data worth storing for the long term? Does the shelf life for tweets really need to go beyond a year?
To save this item to your list of favorite Big Data Republic content so you can find it later in your Profile page, click the "Save It" button next to the item.
If you found this interesting or useful, please use the links to the services below to share it with other readers. You will need a free account with each service to share an item via that service.