John Huelsenbeck
e-mail: johnh@biomail.ucsd.edu |
In
The Origin of Species, Darwin founded evolutionary biology on two
ideas: (1) all species are related to one another through a history
of
common descent and (2) the exquisite match between a species and
its
environment is explained with natural selection, a process in which
individuals with beneficial mutations leave more offspring. Much
of
the work in evolutionary biology over the past century has elaborated
on these ideas. For example, population geneticists developed mathematical
models to describe the behavior of mutations in populations. Another
group of scientists (mostly museum researchers) developed the field
of
phylogenetics to infer the evolutionary history of life. Both population
genetics and phylogenetics have been radically transformed by the
recent
flood of molecular data. Molecular biology provided observations
that
challenged population genetic theory (e.g., in the 1960's, researchers
found more variability at the molecular level than expected and
stimulated
Kimura's neutral theory of molecular evolution). Similarly, molecular
biology has provided the genetic data needed to make inferences
about
the genealogy of distantly related species. To a large extent, our
research
interests are motivated by the challenges that molecular data pose
to
phylogenetics and population genetics.
The Phylogeny Problem.
Evolutionary biology is founded on the concept that
organisms share a common origin and have subsequently diverged through
time. Phylogenies represent our attempts to reconstruct those evolutionary
histories, and there is probably more interest in phylogenetic reconstruction
today than at any time in the past. Phylogenies are central to virtually
all comparisons among species, and they have found practical uses
in
tracing routes of infectious disease transmission (e.g., dental
transmission
of AIDS/HIV) and in identifying new pathogens such as the New Mexico
hantavirus.
The phylogeny problem--the
estimation of the genealogy of organisms from DNA
sequences--is not a standard statistical one. Hence one cannot simply
consult
statistical texts for a solution. Our research concentrates on how
phylogeny
can be estimated and how phylogenies can be used to address questions
in
evolutionary biology. In general, we have taken a Bayesian approach
to the
inference of phylogeny. Bayesian inference is a widely used method
for
making statistical inferences but has found only limited use in
evolutionary
biology. The technology I use to perform Bayesian analysis of DNA
sequences
is Markov chain Monte Carlo (MCMC). MCMC takes valid, albeit dependent,
samples from the probability distribution of interest and has made
Bayesian
inference practical for many scientific problems. Here we outline
a few of
the phylogenetic questions that we are interested in.
Estimating large phylogenies.
There are only three possible trees that could
represent the phylogenetic history of three species: (A,(B,C));
(B,(A,C));
and (C,(A,B)). Even a method that picks one of the trees at random,
then,
has a reasonable chance of correctly inferring phylogenetic history.
However,
for a "small" phylogenetic problem involving 10 species,
there are 34,459,425
possible trees, and for a problem of only 22 species, there is over
a mole of
trees. Today, most phylogenetic problems involve over 80 species
and there are
some data sets that have over 500 species. (For 500 species, there
are
approximately 1.0085 X 101280 possible trees, only one
of which can be
correct.) The analysis of phylogenetic problems involving hundreds
of sequences
poses enormous compuational problems.
Most of the methods for
tackling such large problems have serious deficiencies.
The optimality criteria used by these methods often have dubious
statistical
justifications. Also many of the methods are simply step-wise addition
algorithms
and make no effort to explore the space of trees. However, the methods
having the
best statistical justification, such as maximum likelihood and Bayesian
inference,
are also the most difficult to implement for large problems. We
are using Bayesian
inference using MCMC to infer large phylogenies. There are several
advantages of
such an approach. For one, the optimality criterion uses all the
information
present in the data and the method provides the posterior probability
of trees.
Also, some variants of MCMC can allow better exploration of the
space of trees.
Comparative analysis.
The comparative method in evolutionary biology involves
comparing one or more features across species. The comparative method
has
provided much of the evidence for natural selection and is probably
the most
widely used statistical method in evolutionary biology. Since the
mid 1980's
it has been realized that phylogeny must be accommodated in comparative
analyses;
failure to take account of the similarity in features across species
that is
caused by a common history can seriously bias comparative analyses,
rendering
them meaningless. Hence, the gold standard for a comparative analysis
today
includes the phylogenetic history of the species. These methods
all, however,
suffer one serious problem: They all assume that the phylogeny is
known without
error. Yet, almost all phylogenies have a large degree of uncertainty.
How can
comparative analyses be performed that accommodate phylogenetic
history but do
not depend upon any single phylogeny being correct?
One potential solution
is to perform analyses while summing over all possible
phylogenetic trees. Inferences are weighted by the probability that
each tree
is correct. Although it is impossible to evaluate all possible trees
for even
moderately large problems, MCMC methods can sample phylogenies according
to
their posterior probabilities.
The Genetics of Adaptation.
Although natural selection has been repeatedly
detected in the evolution of morphological traits (such as the beak
of Darwin's
finches) and although the footprint of natural selection is evident
in virtually
all genes that have been sequenced to date, the genetic basis of
most adaptations
remains unknown. The reason is that the process of adaptive change
remains
difficult to study directly. We have been examing the genetics of
adaptation
using an experimental system of ssRNA bacteriophage of the family
Leviviridae.
This family includes several of the best-studied phage, such as
QB and MS2,
and has several features, such as a small genome (3600 to 4200 nt
with four genes)
and high mutation rate (1.5 X 10-3 mutations/replication),
that render it ideal
for studying the dynamics of adaptation at the molecular level.
The central idea is to
adapt phage to novel environments and to detect all of
the genetic changes that occur during the course of adaptation to
the new
envionment. Specifically, we are adapting phage to growth at high
temperature
(43 deg. C), to growth in a novel host (Salmonella instead of E.
coli), and to
escape antibodies made to the whole phage. But the important point
is this: the
number and effect of the mutations that occurred can be directly
assayed as the
complete genome can be sequenced and the effect of each mutation
can be measured
as the phage growth rate. Moreover, because we can store all of
the ancestral
phage populations in the freezer, we can examine the dynamics of
the adaptive
changes. We can thus address several outstanding questions in evolutionary
biology
about the genetics of adaptation: (1) How many mutations are involved
in adaptive
change?; and (2) What is the distribution of their fitness effects
during a bout
of adaptation? For this second question theory specifically predicts
that the
distribution should be nearly exponential. There are also several
questions
specific to the phage system that can be addressed: (3) Do compensatory
mutations
occur in stem regions of RNA?; (4) When the phage are adapted to
growth at high
temperature, are the beneficial mutations those that increase the
stability of
the RNA molecule?; and (5) Are phage constrained by their phylogenetic
history?
For the high temperature experiment, for example, we are adapting
eight different
species that represent the range of Leviviridae to grow well at
high temperature.
For each species, the experiment is repeated three times (with one
control line
at 37 deg C). We can ask if the adaptive changes are repeatable
within species
and also if they are repeatable between species. Because the phylogeny
of the
group is known, we can examine the extent to which phylogeny predicts
adaptive
sites that are shared across species. The first (preliminary) high
temperature
experiment on QB indicates that a small number of mutations (10)
are responsible
for most of the fitness increase, all of which were to G or C and
all of which
increased the thermodynamic stability of the RNA molecule at high
temperature.