John Huelsenbeck
Professor of Biology, UCSD

e-mail: johnh@biomail.ucsd.edu
lab homepage: http://brahms.ucsd.edu
 

         In The Origin of Species, Darwin founded evolutionary biology on two
ideas: (1) all species are related to one another through a history of
common descent and (2) the exquisite match between a species and its
environment is explained with natural selection, a process in which
individuals with beneficial mutations leave more offspring. Much of
the work in evolutionary biology over the past century has elaborated
on these ideas. For example, population geneticists developed mathematical
models to describe the behavior of mutations in populations. Another
group of scientists (mostly museum researchers) developed the field of
phylogenetics to infer the evolutionary history of life. Both population
genetics and phylogenetics have been radically transformed by the recent
flood of molecular data. Molecular biology provided observations that
challenged population genetic theory (e.g., in the 1960's, researchers
found more variability at the molecular level than expected and stimulated
Kimura's neutral theory of molecular evolution). Similarly, molecular
biology has provided the genetic data needed to make inferences about
the genealogy of distantly related species. To a large extent, our research
interests are motivated by the challenges that molecular data pose to
phylogenetics and population genetics.

        The Phylogeny Problem. Evolutionary biology is founded on the concept that
organisms share a common origin and have subsequently diverged through
time. Phylogenies represent our attempts to reconstruct those evolutionary
histories, and there is probably more interest in phylogenetic reconstruction
today than at any time in the past. Phylogenies are central to virtually
all comparisons among species, and they have found practical uses in
tracing routes of infectious disease transmission (e.g., dental transmission
of AIDS/HIV) and in identifying new pathogens such as the New Mexico hantavirus.

        The phylogeny problem--the estimation of the genealogy of organisms from DNA
sequences--is not a standard statistical one. Hence one cannot simply consult
statistical texts for a solution. Our research concentrates on how phylogeny
can be estimated and how phylogenies can be used to address questions in
evolutionary biology. In general, we have taken a Bayesian approach to the
inference of phylogeny. Bayesian inference is a widely used method for
making statistical inferences but has found only limited use in evolutionary
biology. The technology I use to perform Bayesian analysis of DNA sequences
is Markov chain Monte Carlo (MCMC). MCMC takes valid, albeit dependent,
samples from the probability distribution of interest and has made Bayesian
inference practical for many scientific problems. Here we outline a few of
the phylogenetic questions that we are interested in.

        Estimating large phylogenies. There are only three possible trees that could
represent the phylogenetic history of three species: (A,(B,C)); (B,(A,C));
and (C,(A,B)). Even a method that picks one of the trees at random, then,
has a reasonable chance of correctly inferring phylogenetic history. However,
for a "small" phylogenetic problem involving 10 species, there are 34,459,425
possible trees, and for a problem of only 22 species, there is over a mole of
trees. Today, most phylogenetic problems involve over 80 species and there are
some data sets that have over 500 species. (For 500 species, there are
approximately 1.0085 X 101280 possible trees, only one of which can be
correct.) The analysis of phylogenetic problems involving hundreds of sequences
poses enormous compuational problems.

        Most of the methods for tackling such large problems have serious deficiencies.
The optimality criteria used by these methods often have dubious statistical
justifications. Also many of the methods are simply step-wise addition algorithms
and make no effort to explore the space of trees. However, the methods having the
best statistical justification, such as maximum likelihood and Bayesian inference,
are also the most difficult to implement for large problems. We are using Bayesian
inference using MCMC to infer large phylogenies. There are several advantages of
such an approach. For one, the optimality criterion uses all the information
present in the data and the method provides the posterior probability of trees.
Also, some variants of MCMC can allow better exploration of the space of trees.

        Comparative analysis. The comparative method in evolutionary biology involves
comparing one or more features across species. The comparative method has
provided much of the evidence for natural selection and is probably the most
widely used statistical method in evolutionary biology. Since the mid 1980's
it has been realized that phylogeny must be accommodated in comparative analyses;
failure to take account of the similarity in features across species that is
caused by a common history can seriously bias comparative analyses, rendering
them meaningless. Hence, the gold standard for a comparative analysis today
includes the phylogenetic history of the species. These methods all, however,
suffer one serious problem: They all assume that the phylogeny is known without
error. Yet, almost all phylogenies have a large degree of uncertainty. How can
comparative analyses be performed that accommodate phylogenetic history but do
not depend upon any single phylogeny being correct?

        One potential solution is to perform analyses while summing over all possible
phylogenetic trees. Inferences are weighted by the probability that each tree
is correct. Although it is impossible to evaluate all possible trees for even
moderately large problems, MCMC methods can sample phylogenies according to
their posterior probabilities.

        The Genetics of Adaptation. Although natural selection has been repeatedly
detected in the evolution of morphological traits (such as the beak of Darwin's
finches) and although the footprint of natural selection is evident in virtually
all genes that have been sequenced to date, the genetic basis of most adaptations
remains unknown. The reason is that the process of adaptive change remains
difficult to study directly. We have been examing the genetics of adaptation
using an experimental system of ssRNA bacteriophage of the family Leviviridae.
This family includes several of the best-studied phage, such as QB and MS2,
and has several features, such as a small genome (3600 to 4200 nt with four genes)
and high mutation rate (1.5 X 10-3 mutations/replication), that render it ideal
for studying the dynamics of adaptation at the molecular level.

        The central idea is to adapt phage to novel environments and to detect all of
the genetic changes that occur during the course of adaptation to the new
envionment. Specifically, we are adapting phage to growth at high temperature
(43 deg. C), to growth in a novel host (Salmonella instead of E. coli), and to
escape antibodies made to the whole phage. But the important point is this: the
number and effect of the mutations that occurred can be directly assayed as the
complete genome can be sequenced and the effect of each mutation can be measured
as the phage growth rate. Moreover, because we can store all of the ancestral
phage populations in the freezer, we can examine the dynamics of the adaptive
changes. We can thus address several outstanding questions in evolutionary biology
about the genetics of adaptation: (1) How many mutations are involved in adaptive
change?; and (2) What is the distribution of their fitness effects during a bout
of adaptation? For this second question theory specifically predicts that the
distribution should be nearly exponential. There are also several questions
specific to the phage system that can be addressed: (3) Do compensatory mutations
occur in stem regions of RNA?; (4) When the phage are adapted to growth at high
temperature, are the beneficial mutations those that increase the stability of
the RNA molecule?; and (5) Are phage constrained by their phylogenetic history?
For the high temperature experiment, for example, we are adapting eight different
species that represent the range of Leviviridae to grow well at high temperature.
For each species, the experiment is repeated three times (with one control line
at 37 deg C). We can ask if the adaptive changes are repeatable within species
and also if they are repeatable between species. Because the phylogeny of the
group is known, we can examine the extent to which phylogeny predicts adaptive
sites that are shared across species. The first (preliminary) high temperature
experiment on QB indicates that a small number of mutations (10) are responsible
for most of the fitness increase, all of which were to G or C and all of which
increased the thermodynamic stability of the RNA molecule at high temperature.