COSMOS magazine

Get COSMOS Teacher's Notes
G Magazine
  • Add this story to stumbleupon
  • Add this story to Digg
  • Add this story to reddit
  • Add this story to Slashdot
  • Add this story to newsvine
  • Add this story to facebook
  • Add this story to technorati
  • Add this story to del-icio-us
  • Add this story to furl

News

Jane Austen helps crack DNA code

Wednesday, 22 November 2006
Cosmos Online
Jane Austen helps crack DNA code

The ebola virus ... an algorithm based on the rules of language is being used to decipher its DNA.

Credit: CDC

SYDNEY: Cold war experience in breaking Russian codes and the structure of Jane Austen's Emma are being used to unravel the secrets of DNA.

Simon Shepherd, a former British Naval Intelligence computational mathematician, now at the University of Bradford in the U.K., is using code-breaking algorithms to decipher the DNA of the Ebola virus, according to a report in the British journal Nature.

"We are treating DNA as we used to treat problems in intelligence," said Shepherd. "We want to break the code at the most fundamental level."

In devising their algorithm, Shepherd and colleagues hypothesised that DNA is like a language, and should therefore obey the same fundamental rules that all languages follow, using words, sentences, punctuation and grammar.

"We thought that if we can get the algorithm to work on English, it might work on DNA," said Shepherd. "Other researchers in related work in the USA chose [Herman] Melville's Moby Dick which, with the greatest respect to Melville, is not the greatest piece of English literature."

For the test Bradford chose the English literary classic Emma by Jane Austen, then removed all spaces and punctuation. So Emma's snobbery becomes "…thoughtitallextremelyshabbyandveryinferiortoherown…".

The algorithm identified 80 per cent of the words and separated them back into sentences, despite not having been programmed with any English vocabulary or grammar.

After such a high rate of success in English literature, the team tried to decipher the seemingly alien language of DNA.

"We have run the entire Ebola virus genome and isolated a good number of biologically active regions," said Shepherd.

The team are interested in using the technique both to combat pathogens and to better understand human DNA.

While the Human Genome Project gave us the sequence of our DNA, understanding its language is far more difficult.

The DNA code comprises four letters - A, T, C and G - which represent adenine, thymine, cytosine and guanine, the four bases that make up genetic material. A combination of three of these letters specifies each amino acid, and chains of amino acids make up proteins which carry out our bodily functions.

However, DNA encodes more information, such as instructions on when and where in the body's tissues proteins should be made, and how the DNA itself is folded and regulated.

The team is concentrating on the signals in DNA that help to determine where it folds around its scaffold of structural proteins.

"The protein folding problem is regarded as one of the three grand-challenge problems of 21st century science," said Shepherd. "Its resolution is crucial to the development of the new drugs and medical therapies that the Human Genome Project promises one day to deliver."

Shepherd agrees that mathematics alone will not unlock the secrets of biology, and the project involves a multi-disciplinary team, including physicists, biologists, chemists and protein experts.

"The point is to search for tiny needles in enormous haystacks so as to have some idea of where to focus effort in the 'wet lab'," said Shepherd.