cambrian-fossils-gray
Intelligent Design The Definitive Source on ID
Articles

The Origin of Biological Information and the Higher Taxonomic Categories

Share
Facebook
Twitter
Print
arroba Email

On August 4th, 2004 an extensive review essay by Dr. Stephen C. Meyer, Director of Discovery Institute’s Center for Science & Culture appeared in the Proceedings of the Biological Society of Washington (volume 117, no. 2, pp. 213-239). The Proceedings is a peer-reviewed biology journal published at the National Museum of Natural History at the Smithsonian Institution in Washington D.C.

In the article, entitled “The Origin of Biological Information and the Higher Taxonomic Categories”, Dr. Meyer argues that no current materialistic theory of evolution can account for the origin of the information necessary to build novel animal forms. He proposes intelligent design as an alternative explanation for the origin of biological information and the higher taxa.

Due to an unusual number of inquiries about the article, Dr. Meyer, the copyright holder, has decided to make the article available now in HTML format on this website. (Off prints are also available from Discovery Institute by writing to Rob Crowther at: cscinfo@discovery.org. Please provide your mailing address and we will dispatch a copy).

Introduction

In a recent volume of the Vienna Series in a Theoretical Biology (2003), Gerd B. Muller and Stuart Newman argue that what they call the “origination of organismal form” remains an unsolved problem. In making this claim, Muller and Newman (2003:3-10) distinguish two distinct issues, namely, (1) the causes of form generation in the individual organism during embryological development and (2) the causes responsible for the production of novel organismal forms in the first place during the history of life. To distinguish the latter case (phylogeny) from the former (ontogeny), Muller and Newman use the term “origination” to designate the causal processes by which biological form first arose during the evolution of life. They insist that “the molecular mechanisms that bring about biological form in modern day embryos should not be confused” with the causes responsible for the origin (or “origination”) of novel biological forms during the history of life (p.3). They further argue that we know more about the causes of ontogenesis, due to advances in molecular biology, molecular genetics and developmental biology, than we do about the causes of phylogenesis — the ultimate origination of new biological forms during the remote past.

In making this claim, Muller and Newman are careful to affirm that evolutionary biology has succeeded in explaining how preexisting forms diversify under the twin influences of natural selection and variation of genetic traits. Sophisticated mathematically-based models of population genetics have proven adequate for mapping and understanding quantitative variability and populational changes in organisms. Yet Muller and Newman insist that population genetics, and thus evolutionary biology, has not identified a specifically causal explanation for the origin of true morphological novelty during the history of life. Central to their concern is what they see as the inadequacy of the variation of genetic traits as a source of new form and structure. They note, following Darwin himself, that the sources of new form and structure must precede the action of natural selection (2003:3) — that selection must act on what already exists. Yet, in their view, the “genocentricity” and “incrementalism” of the neo-Darwinian mechanism has meant that an adequate source of new form and structure has yet to be identified by theoretical biologists. Instead, Muller and Newman see the need to identify epigenetic sources of morphological innovation during the evolution of life. In the meantime, however, they insist neo-Darwinism lacks any “theory of the generative” (p. 7).

As it happens, Muller and Newman are not alone in this judgment. In the last decade or so a host of scientific essays and books have questioned the efficacy of selection and mutation as a mechanism for generating morphological novelty, as even a brief literature survey will establish. Thomson (1992:107) expressed doubt that large-scale morphological changes could accumulate via minor phenotypic changes at the population genetic level. Miklos (1993:29) argued that neo-Darwinism fails to provide a mechanism that can produce large-scale innovations in form and complexity. Gilbert et al. (1996) attempted to develop a new theory of evolutionary mechanisms to supplement classical neo-Darwinism, which, they argued, could not adequately explain macroevolution. As they put it in a memorable summary of the situation: “starting in the 1970s, many biologists began questioning its (neo-Darwinism’s) adequacy in explaining evolution. Genetics might be adequate for explaining microevolution, but microevolutionary changes in gene frequency were not seen as able to turn a reptile into a mammal or to convert a fish into an amphibian. Microevolution looks at adaptations that concern the survival of the fittest, not the arrival of the fittest. As Goodwin (1995) points out, ‘the origin of species — Darwin’s problem — remains unsolved'” (p. 361). Though Gilbert et al. (1996) attempted to solve the problem of the origin of form by proposing a greater role for developmental genetics within an otherwise neo-Darwinian framework,1 numerous recent authors have continued to raise questions about the adequacy of that framework itself or about the problem of the origination of form generally (Webster & Goodwin 1996; Shubin & Marshall 2000; Erwin 2000; Conway Morris 2000, 2003b; Carroll 2000; Wagner 2001; Becker & Lonnig 2001; Stadler et al. 2001; Lonnig & Saedler 2002; Wagner & Stadler 2003; Valentine 2004:189-194).

What lies behind this skepticism? Is it warranted? Is a new and specifically causal theory needed to explain the origination of biological form?

This review will address these questions. It will do so by analyzing the problem of the origination of organismal form (and the corresponding emergence of higher taxa) from a particular theoretical standpoint. Specifically, it will treat the problem of the origination of the higher taxonomic groups as a manifestation of a deeper problem, namely, the problem of the origin of the information (whether genetic or epigenetic) that, as it will be argued, is necessary to generate morphological novelty.

In order to perform this analysis, and to make it relevant and tractable to systematists and paleontologists, this paper will examine a paradigmatic example of the origin of biological form and information during the history of life: the Cambrian explosion. During the Cambrian, many novel animal forms and body plans (representing new phyla, subphyla and classes) arose in a geologically brief period of time. The following information-based analysis of the Cambrian explosion will support the claim of recent authors such as Muller and Newman that the mechanism of selection and genetic mutation does not constitute an adequate causal explanation of the origination of biological form in the higher taxonomic groups. It will also suggest the need to explore other possible causal factors for the origin of form and information during the evolution of life and will examine some other possibilities that have been proposed.

The Cambrian Explosion

The “Cambrian explosion” refers to the geologically sudden appearance of many new animal body plans about 530 million years ago. At this time, at least nineteen, and perhaps as many as thirty-five phyla of forty total (Meyer et al. 2003), made their first appearance on earth within a narrow five- to ten-million-year window of geologic time (Bowring et al. 1993, 1998a:1, 1998b:40; Kerr 1993; Monastersky 1993; Aris-Brosou & Yang 2003). Many new subphyla, between 32 and 48 of 56 total (Meyer et al. 2003), and classes of animals also arose at this time with representatives of these new higher taxa manifesting significant morphological innovations. The Cambrian explosion thus marked a major episode of morphogenesis in which many new and disparate organismal forms arose in a geologically brief period of time.

To say that the fauna of the Cambrian period appeared in a geologically sudden manner also implies the absence of clear transitional intermediate forms connecting Cambrian animals with simpler pre-Cambrian forms. And, indeed, in almost all cases, the Cambrian animals have no clear morphological antecedents in earlier Vendian or Precambrian fauna (Miklos 1993, Erwin et al. 1997:132, Steiner & Reitner 2001, Conway Morris 2003b:510, Valentine et al. 2003:519-520). Further, several recent discoveries and analyses suggest that these morphological gaps may not be merely an artifact of incomplete sampling of the fossil record (Foote 1997, Foote et al. 1999, Benton & Ayala 2003, Meyer et al. 2003), suggesting that the fossil record is at least approximately reliable (Conway Morris 2003b:505).

As a result, debate now exists about the extent to which this pattern of evidence comports with a strictly monophyletic view of evolution (Conway Morris 1998a, 2003a, 2003b:510; Willmer 1990, 2003). Further, among those who accept a monophyletic view of the history of life, debate exists about whether to privilege fossil or molecular data and analyses. Those who think the fossil data provide a more reliable picture of the origin of the Metazoan tend to think these animals arose relatively quickly — that the Cambrian explosion had a “short fuse.” (Conway Morris 2003b:505-506, Valentine & Jablonski 2003). Some (Wray et al. 1996), but not all (Ayala et al. 1998), who think that molecular phylogenies establish reliable divergence times from pre-Cambrian ancestors think that the Cambrian animals evolved over a very long period of time — that the Cambrian explosion had a “long fuse.” This review will not address these questions of historical pattern. Instead, it will analyze whether the neo-Darwinian process of mutation and selection, or other processes of evolutionary change, can generate the form and information necessary to produce the animals that arise in the Cambrian. This analysis will, for the most part,2 therefore, not depend upon assumptions of either a long or short fuse for the Cambrian explosion, or upon a monophyletic or polyphyletic view of the early history of life.

Defining Biological Form and Information

Form, like life itself, is easy to recognize but often hard to define precisely. Yet, a reasonable working definition of form will suffice for our present purposes. Form can be defined as the four-dimensional topological relations of anatomical parts. This means that one can understand form as a unified arrangement of body parts or material components in a distinct shape or pattern (topology) — one that exists in three spatial dimensions and which arises in time during ontogeny.

Insofar as any particular biological form constitutes something like a distinct arrangement of constituent body parts, form can be seen as arising from constraints that limit the possible arrangements of matter. Specifically, organismal form arises (both in phylogeny and ontogeny) as possible arrangements of material parts are constrained to establish a specific or particular arrangement with an identifiable three dimensional topography — one that we would recognize as a particular protein, cell type, organ, body plan or organism. A particular “form,” therefore, represents a highly specific and constrained arrangement of material components (among a much larger set of possible arrangements).

Understanding form in this way suggests a connection to the notion of information in its most theoretically general sense. When Shannon (1948) first developed a mathematical theory of information he equated the amount of information transmitted with the amount of uncertainty reduced or eliminated in a series of symbols or characters. Information, in Shannon’s theory, is thus imparted as some options are excluded and others are actualized. The greater the number of options excluded, the greater the amount of information conveyed. Further, constraining a set of possible material arrangements by whatever process or means involves excluding some options and actualizing others. Thus, to constrain a set of possible material states is to generate information in Shannon’s sense. It follows that the constraints that produce biological form also imparted information. Or conversely, one might say that producing organismal form by definition requires the generation of information.

In classical Shannon information theory, the amount of information in a system is also inversely related to the probability of the arrangement of constituents in a system or the characters along a communication channel (Shannon 1948). The more improbable (or complex) the arrangement, the more Shannon information, or information-carrying capacity, a string or system possesses.

Since the 1960s, mathematical biologists have realized that Shannon’s theory could be applied to the analysis of DNA and proteins to measure the information-carrying capacity of these macromolecules. Since DNA contains the assembly instructions for building proteins, the information-processing system in the cell represents a kind of communication channel (Yockey 1992:110). Further, DNA conveys information via specifically arranged sequences of nucleotide bases. Since each of the four bases has a roughly equal chance of occurring at each site along the spine of the DNA molecule, biologists can calculate the probability, and thus the information-carrying capacity, of any particular sequence n bases long.

The ease with which information theory applies to molecular biology has created confusion about the type of information that DNA and proteins possess. Sequences of nucleotide bases in DNA, or amino acids in a protein, are highly improbable and thus have large information-carrying capacities. But, like meaningful sentences or lines of computer code, genes and proteins are also specified with respect to function. Just as the meaning of a sentence depends upon the specific arrangement of the letters in a sentence, so too does the function of a gene sequence depend upon the specific arrangement of the nucleotide bases in a gene. Thus, molecular biologists beginning with Crick equated information not only with complexity but also with “specificity,” where “specificity” or “specified” has meant “necessary to function” (Crick 1958:144, 153; Sarkar, 1996:191).3 Molecular biologists such as Monod and Crick understood biological information — the information stored in DNA and proteins–as something more than mere complexity (or improbability). Their notion of information associated both biochemical contingency and combinatorial complexity with DNA sequences (allowing DNA’s carrying capacity to be calculated), but it also affirmed that sequences of nucleotides and amino acids in functioning macromolecules possessed a high degree of specificity relative to the maintenance of cellular function.

The ease with which information theory applies to molecular biology has also created confusion about the location of information in organisms. Perhaps because the information carrying capacity of the gene could be so easily measured, it has been easy to treat DNA, RNA and proteins as the sole repositories of biological information. Neo-Darwinists in particular have assumed that the origination of biological form could be explained by recourse to processes of genetic variation and mutation alone (Levinton 1988:485). Yet if one understands organismal form as resulting from constraints on the possible arrangements of matter at many levels in the biological hierarchy–from genes and proteins to cell types and tissues to organs and body plans–then clearly biological organisms exhibit many levels of information-rich structure.

Thus, we can pose a question, not only about the origin of genetic information, but also about the origin of the information necessary to generate form and structure at levels higher than that present in individual proteins. We must also ask about the origin of the “specified complexity,” as opposed to mere complexity, that characterizes the new genes, proteins, cell types and body plans that arose in the Cambrian explosion. Dembski (2002) has used the term “complex specified information” (CSI) as a synonym for “specified complexity” to help distinguish functional biological information from mere Shannon information–that is, specified complexity from mere complexity. This review will use this term as well.

The Cambrian Information Explosion

The Cambrian explosion represents a remarkable jump in the specified complexity or “complex specified information” (CSI) of the biological world. For over three billions years, the biological realm included little more than bacteria and algae (Brocks et al. 1999). Then, beginning about 570-565 million years ago (mya), the first complex multicellular organisms appeared in the rock strata, including sponges, cnidarians, and the peculiar Ediacaran biota (Grotzinger et al. 1995). Forty million years later, the Cambrian explosion occurred (Bowring et al. 1993). The emergence of the Ediacaran biota (570 mya), and then to a much greater extent the Cambrian explosion (530 mya), represented steep climbs up the biological complexity gradient.

One way to estimate the amount of new CSI that appeared with the Cambrian animals is to count the number of new cell types that emerged with them (Valentine 1995:91-93). Studies of modern animals suggest that the sponges that appeared in the late Precambrian, for example, would have required five cell types, whereas the more complex animals that appeared in the Cambrian (e.g., arthropods) would have required fifty or more cell types. Functionally more complex animals require more cell types to perform their more diverse functions. New cell types require many new and specialized proteins. New proteins, in turn, require new genetic information. Thus an increase in the number of cell types implies (at a minimum) a considerable increase in the amount of specified genetic information. Molecular biologists have recently estimated that a minimally complex single-celled organism would require between 318 and 562 kilobase pairs of DNA to produce the proteins necessary to maintain life (Koonin 2000). More complex single cells might require upward of a million base pairs. Yet to build the proteins necessary to sustain a complex arthropod such as a trilobite would require orders of magnitude more coding instructions. The genome size of a modern arthropod, the fruitfly Drosophila melanogaster, is approximately 180 million base pairs (Gerhart & Kirschner 1997:121, Adams et al. 2000). Transitions from a single cell to colonies of cells to complex animals represent significant (and, in principle, measurable) increases in CSI.

Building a new animal from a single-celled organism requires a vast amount of new genetic information. It also requires a way of arranging gene products–proteins–into higher levels of organization. New proteins are required to service new cell types. But new proteins must be organized into new systems within the cell; new cell types must be organized into new tissues, organs, and body parts. These, in turn, must be organized to form body plans. New animals, therefore, embody hierarchically organized systems of lower-level parts within a functional whole. Such hierarchical organization itself represents a type of information, since body plans comprise both highly improbable and functionally specified arrangements of lower-level parts. The specified complexity of new body plans requires explanation in any account of the Cambrian explosion.

Can neo-Darwinism explain the discontinuous increase in CSI that appears in the Cambrian explosion–either in the form of new genetic information or in the form of hierarchically organized systems of parts? We will now examine the two parts of this question.

Novel Genes and Proteins

Many scientists and mathematicians have questioned the ability of mutation and selection to generate information in the form of novel genes and proteins. Such skepticism often derives from consideration of the extreme improbability (and specificity) of functional genes and proteins.

A typical gene contains over one thousand precisely arranged bases. For any specific arrangement of four nucleotide bases of length n, there is a corresponding number of possible arrangements of bases, 4n. For any protein, there are 20n possible arrangements of protein-forming amino acids. A gene 999 bases in length represents one of 4999 possible nucleotide sequences; a protein of 333 amino acids is one of 20333 possibilities.

Since the 1960s, some biologists have thought functional proteins to be rare among the set of possible amino acid sequences. Some have used an analogy with human language to illustrate why this should be the case. Denton (1986, 309-311), for example, has shown that meaningful words and sentences are extremely rare among the set of possible combinations of English letters, especially as sequence length grows. (The ratio of meaningful 12-letter words to 12-letter sequences is 1/1014, the ratio of 100-letter sentences to possible 100-letter strings is 1/10100.) Further, Denton shows that most meaningful sentences are highly isolated from one another in the space of possible combinations, so that random substitutions of letters will, after a very few changes, inevitably degrade meaning. Apart from a few closely clustered sentences accessible by random substitution, the overwhelming majority of meaningful sentences lie, probabilistically speaking, beyond the reach of random search.

Denton (1986:301-324) and others have argued that similar constraints apply to genes and proteins. They have questioned whether an undirected search via mutation and selection would have a reasonable chance of locating new islands of function–representing fundamentally new genes or proteins–within the time available (Eden 1967, Shutzenberger 1967, Lovtrup 1979). Some have also argued that alterations in sequencing would likely result in loss of protein function before fundamentally new function could arise (Eden 1967, Denton 1986). Nevertheless, neither the extent to which genes and proteins are sensitive to functional loss as a result of sequence change, nor the extent to which functional proteins are isolated within sequence space, has been fully known.

Recently, experiments in molecular biology have shed light on these questions. A variety of mutagenesis techniques have shown that proteins (and thus the genes that produce them) are indeed highly specified relative to biological function (Bowie & Sauer 1989, Reidhaar-Olson & Sauer 1990, Taylor et al. 2001). Mutagenesis research tests the sensitivity of proteins (and, by implication, DNA) to functional loss as a result of alterations in sequencing. Studies of proteins have long shown that amino acid residues at many active positions cannot vary without functional loss (Perutz & Lehmann 1968). More recent protein studies (often using mutagenesis experiments) have shown that functional requirements place significant constraints on sequencing even at non-active site positions (Bowie & Sauer 1989, Reidhaar-Olson & Sauer 1990, Chothia et al. 1998, Axe 2000, Taylor et al. 2001). In particular, Axe (2000) has shown that multiple as opposed to single position amino acid substitutions inevitably result in loss of protein function, even when these changes occur at sites that allow variation when altered in isolation. Cumulatively, these constraints imply that proteins are highly sensitive to functional loss as a result of alterations in sequencing, and that functional proteins represent highly isolated and improbable arrangements of amino acids -arrangements that are far more improbable, in fact, than would be likely to arise by chance alone in the time available (Reidhaar-Olson & Sauer 1990; Behe 1992; Kauffman 1995:44; Dembski 1998:175-223; Axe 2000, 2004). (See below the discussion of the neutral theory of evolution for a precise quantitative assessment.)

Of course, neo-Darwinists do not envision a completely random search through the set of all possible nucleotide sequences–so-called “sequence space.” They envision natural selection acting to preserve small advantageous variations in genetic sequences and their corresponding protein products. Dawkins (1996), for example, likens an organism to a high mountain peak. He compares climbing the sheer precipice up the front side of the mountain to building a new organism by chance. He acknowledges that his approach up “Mount Improbable” will not succeed. Nevertheless, he suggests that there is a gradual slope up the backside of the mountain that could be climbed in small incremental steps. In his analogy, the backside climb up “Mount Improbable” corresponds to the process of natural selection acting on random changes in the genetic text. What chance alone cannot accomplish blindly or in one leap, selection (acting on mutations) can accomplish through the cumulative effect of many slight successive steps.

Yet the extreme specificity and complexity of proteins presents a difficulty, not only for the chance origin of specified biological information (i.e., for random mutations acting alone), but also for selection and mutation acting in concert. Indeed, mutagenesis experiments cast doubt on each of the two scenarios by which neo-Darwinists envisioned new information arising from the mutation/selection mechanism (for review, see Lonnig 2001). For neo-Darwinism, new functional genes either arise from non-coding sections in the genome or from preexisting genes. Both scenarios are problematic.

In the first scenario, neo-Darwinists envision new genetic information arising from those sections of the genetic text that can presumably vary freely without consequence to the organism. According to this scenario, non-coding sections of the genome, or duplicated sections of coding regions, can experience a protracted period of “neutral evolution” (Kimura 1983) during which alterations in nucleotide sequences have no discernible effect on the function of the organism. Eventually, however, a new gene sequence will arise that can code for a novel protein. At that point, natural selection can favor the new gene and its functional protein product, thus securing the preservation and heritability of both.

This scenario has the advantage of allowing the genome to vary through many generations, as mutations “search” the space of possible base sequences. The scenario has an overriding problem, however: the size of the combinatorial space (i.e., the number of possible amino acid sequences) and the extreme rarity and isolation of the functional sequences within that space of possibilities. Since natural selection can do nothing to help generate new functional sequences, but rather can only preserve such sequences once they have arisen, chance alone–random variation–must do the work of information generation–that is, of finding the exceedingly rare functional sequences within the set of combinatorial possibilities. Yet the probability of randomly assembling (or “finding,” in the previous sense) a functional sequence is extremely small.

Cassette mutagenesis experiments performed during the early 1990s suggest that the probability of attaining (at random) the correct sequencing for a short protein 100 amino acids long is about 1 in 1065 (Reidhaar-Olson & Sauer 1990, Behe 1992:65-69). This result agreed closely with earlier calculations that Yockey (1978) had performed based upon the known sequence variability of cytochrome c in different species and other theoretical considerations. More recent mutagenesis research has provided additional support for the conclusion that functional proteins are exceedingly rare among possible amino acid sequences (Axe 2000, 2004). Axe (2004) has performed site directed mutagenesis experiments on a 150-residue protein-folding domain within a B-lactamase enzyme. His experimental method improves upon earlier mutagenesis techniques and corrects for several sources of possible estimation error inherent in them. On the basis of these experiments, Axe has estimated the ratio of (a) proteins of typical size (150 residues) that perform a specified function via any folded structure to (b) the whole set of possible amino acids sequences of that size. Based on his experiments, Axe has estimated his ratio to be 1 to 1077. Thus, the probability of finding a functional protein among the possible amino acid sequences corresponding to a 150-residue protein is similarly 1 in 1077.

Other considerations imply additional improbabilities. First, new Cambrian animals would require proteins much longer than 100 residues to perform many necessary specialized functions. Ohno (1996) has noted that Cambrian animals would have required complex proteins such as lysyl oxidase in order to support their stout body structures. Lysyl oxidase molecules in extant organisms comprise over 400 amino acids. These molecules are both highly complex (non-repetitive) and functionally specified. Reasonable extrapolation from mutagenesis experiments done on shorter protein molecules suggests that the probability of producing functionally sequenced proteins of this length at random is so small as to make appeals to chance absurd, even granting the duration of the entire universe. (See Dembski 1998:175-223 for a rigorous calculation of this “Universal Probability Bound”; See also Axe 2004.) Yet, second, fossil data (Bowring et al. 1993, 1998a:1, 1998b:40; Kerr 1993; Monatersky 1993), and even molecular analyses supporting deep divergence (Wray et al. 1996), suggest that the duration of the Cambrian explosion (between 5-10 x 106 and, at most, 7 x 107 years) is far smaller than that of the entire universe (1.3-2 x 1010 years). Third, DNA mutation rates are far too low to generate the novel genes and proteins necessary to building the Cambrian animals, given the most probable duration of the explosion as determined by fossil studies (Conway Morris 1998b). As Ohno (1996:8475) not