Saturday, June 12, 2010

Who Taught Genes to Speak "Genish"?

I started reading some Matt Ridley books in response to something posted on Grim's Hall a few weeks ago. The first one I picked up, "Genome," startled me with the casual mention of a fact I never had encountered in school or lay reading: that DNA uses a code for amino acids that appears to be completely abstract, like human language or computer code. I supposed I’d always assumed that the message carried from the chromosomes to the cell’s factory was a specially shaped molecule that molded to raw materials and shaped the intended amino acid by direct physical influence. In all my life I’d never run across the notion that any creature or thing other than sentient humans used fully abstract language. Machines can be made to use it when humans construct them that way. Some believe that birds, dolphins, whales, and primates come close to an abstract language. Dogs can certainly learn to associate arbitrary sounds with particular objects or actions. But I’d never heard a whisper about a true abstract code in use by inanimate molecules.

Obviously I’m no biochemist, but I do like to read in this area, and I’m generally familiar with the kind of electrifying concept that starts people producing popular science books and gets them shouting at each other about things like Intelligent Design. Why isn’t this concept the subject of more popular debate? The 3-“letter” codes that DNA assigns to amino acids were deciphered more than 50 years ago, by people who got Nobel Prizes for the work. The same code is used by virtually all life on earth (an interesting exception being the mitochondria within our cells, which use a different language - - like finding a non-Indo-European outcrop among the Basques in the middle of Western Europe). It’s apparently an accepted, commonplace notion that the genetic code is abstract rather than stereochemical. That is, the molecular triplets bear no special spatial or chemical relation to the amino acids to which they are assigned; they are neither mirror-images nor the “keys” that fit a chemically shaped “lock.” Evidence even for a primordial version of RNA that relied more on molecular shapes than on an abstract code is ambiguous. It seems that genes speak a true, abstract, digital “Genish” language.

We all learned in school that DNA’s information is stored in long strands of words spelled with the four letters A, T, G, and C (representing the four bases, adenine, thymine, guanine, and cytosine). The simple but breathtaking cleverness of this scheme lies in the tendency of A to attach to T while C attaches to G in the rungs of the twisted DNA ladderlike strand. As a result, a sequence of A-T-C-G words on one half of the ladder corresponds to its mirror-image T-A-G-C on the other half. When unzipped, the ladder attracts its mirror image in messenger RNA, which can recreate the original sequences in another mirror image after completing its journey to the factory department of the cell. (Or maybe the factory reads it backwards, I don’t know.) So far, so good, but did you ever wonder how the cell’s factory “reads” the meaningful A-T-C-G sequence so that it can synthesize a particular protein from the specified amino acid recipe? It’s not hard to find a description of the physical process of moving down the A-T-C-G ticker tape from left to right and so on, but how does the factory know what the message means?

The ability of the A-T-C-G alphabet to encompass complex information is not in itself mysterious. Even simple pairs of letters in a four-letter alphabet will yield 4 to the 2d power, or 16, words. Triplets of letters will yield 4 to the 3d power, or 64, words, more than adequate to describe the 20 amino acids that make up the proteins used by all life on Earth. Words of even slightly greater length quickly make possible a vocabulary well up to the demands of highly complex messages. As it happens, each amino acid can be fully identified by a single A-T-C-G triplet. In protozoa, sharks, or people, the Genish word “CAA” spells the amino acid glutamine. Because there are 64 three-letter A-T-C-G words and only 20 commonly used amino acids, there is room for quite a few synonyms, and glutamine can also be spelled “CAG.” The amino acid arginine has six possible spellings: CGT, CGC, CGA, CGG, AGA, and AGG; other amino acid have different numbers of possible spellings.

But who is there at the receiving end to “read” these triplet-words, and how do they manage it? For, surprisingly, the messages delivered from your genes in Genish appear to be purely abstract. “CAA” spells glutamine, not because the glutamine molecule happens to have a shape that resembles the surface of a cytosine base followed by two adenine bases, but because life forms have adopted a coded system in which CAA has been assigned this abstract meaning.

It is mind-blowing that genes do not communicate with the cell’s factory workers stereochemically (by direct touch, shape, and charge) but by symbolic representation. The chasm between direct communication and communication by language is fantastically deep and mysterious. Replication by direct communication may be startling in its effects, but it is not difficult to grasp in its concept. Fire “communicates” its infectious pattern to its next source of fuel. The prions that cause Mad Cow disease spread their “message” by touching similar proteins and physically, electrically, or chemically inducing them to re-fold themselves in a new pattern that, unfortunately for animals and people, fatally disrupts their brain functions. (Prions therefore echo the fictional “Ice-9” substance popularized in Kurt Vonnegut’s “Cat’s Cradle,” in which a form of water that was crystalline at room temperature could “infect” ordinary water and render it hard as ice in a disastrous world-wide chain reaction.)

All these changes proceed by direct contact and physical influence, in stark contrast to communication by the abstract symbols of language. A child waiting to cross the street while a bus hurtles toward him may be stopped by his mother’s arm thrown protectively in front of him. But in order to be saved instead by a “Don’t Walk” sign, he must read and understand a message delivered via symbols that lack physical force.

So, as in the old joke about the thermos that keeps your lunch either hot or cold, “How does it know?” Even more interesting, how did genes learn to speak Genish in the first place?

Evolutionary biology answers most questions with the all-purpose mechanism of selective pressure. However powerful an explanation this provides for the adaptation and preferential survival of living organisms that already possess DNA, it lacks force in the question of how DNA, or even the more primitive RNA that may have been DNA’s prototype, could have begun to employ abstract codes to begin with. To a limited degree, small, simple strings of nucleotides (the A-T-C-G strings) or peptides can be observed spontaneously to replicate themselves in a test vial. It is far less clear how such strings could spontaneously begin instructing other molecules to synthesize proteins according to a recipe written in abstract code. It is not simply the old question of whether a monkey with a typewriter, given enough time, could produce “Hamlet.” It’s more as though the monkey produced English vocabulary and grammar.

It’s remarkable how little attention this question has received in popular publication. Most discussions begin with the primordial soup and progress to spontaneous production of sequences of molecular “letters.” But then they make the huge leap to a world in which each amino acid already has been assigned one or more 3-letter codes, after which the way is paved for the slow, inexorable process of natural selection on the combinations of amino acids that result from those codes.

Surely the spontaneous development of a complex, abstract language deserves more curiosity? People have been wondering and arguing over the development of language even in complex, intelligent primates for a very long time without coming up with definitive answers. We think hard about whether dolphins and whales can be said to have a language, and whether gorillas or computers can learn the trick. Why are we not astounded that molecules pulled it off nearly 4 billion years ago, before life had even begun to sort itself out into primitive cell-like structures, let alone into multicellular organisms, dinosaurs, or people?


  1. I see from later reading that I was way off base on half of this, but not on the other half. The mechanisms for the cell factories (ribosomes) to "read" the RNA codons are reasonably well understand and consist of traditional chemical messengers (tRNA molecules with a codon-shaped piece on one end and an amino-acid-shaped piece on the other). But the question of how the code could have developed in the first place is as mysterious as I imagined it was. The evidence for a stereochemical origin is weak so far.

    So I'll have to completely re-write this once I've read some more things and vanquished more ignorance.

  2. Looks like a potentially interesting blog (found you via Cassandra) about some more posts?

  3. Since Grim has been kind enough to let me post at, I really was just using this space to practice formatting, and anything new is over there! Thanks for asking, though!

  4. "Evolutionary biology answers most questions with the all-purpose mechanism of selective pressure. However powerful an explanation this provides for the adaptation and preferential survival of living organisms that already possess DNA, it lacks force in the question of how DNA, or even the more primitive RNA that may have been DNA’s prototype, could have begun to employ abstract codes to begin with."

    Interestingly, it is this basic premise that caused the (former) atheist in charge of the Human Genome Project to question his (former) assumptions and become convinced that there must indeed be a Creator, A.K.A. God.

  5. That is interesting. I guess it's not that much different from me, but for me the thing I couldn't get past was the general orderliness of things, like the inverse square law. Of course that only gets you to a kind of vague Deism, so it's not the whole road.

  6. There is a scientific game called Fold-It where you "fold" proteins. Folding is when you sort a protein from its current state to its preferred state, or something. As you progress, you will delve into the ACTG thing. It is rather random instead of algorithmic. It got too hard for me, but based on your post, you may enjoy it.

  7. Thanks - that's interesting, since I've been reading about protein folding lately. There's a post at about it, starting with a kind of electronic self-folding origami gizmo that does a very simple form of the enormously complicated thing proteins pull off using only their shapes and valences.

    It does appear to me from recent reading that the communication between ATCG code and amino acids is pretty straightforward. There's no direct relationship between each amino acid and its triplet code, but there's an intermediary molecule that's amino-acid-shaped on one end and triplet-code-shaped on the other.

    The harder question is how that kind of translation system could have developed to start with, since when it began the evolutionary national selection pressure weren't yet up and running. Nick Lane, among others, has interesting and reasonably plausible ideas about how that could have gotten going. At least he's curious about it, which I'm finding is a little unusual.

  8. I'd like to check up on what you thought of the DVDs, Texan.

    you can also contact me at ymarsakar at yahoo com