Background
Antigenic drift and the generation of viral
quasispecies
Some RNA viruses form a quasispecies--a set of related
viral variants that coexist in field populations and even
within single infected individuals (reviewed in [ 1, 2,
3, 4, 5]). The emergence of immunologically distinct
members of a viral quasispecies through mutation and
subsequent immune selection is called "antigenic drift."
Antigenic drift is thought to be important in human
immunodeficiency virus (HIV) infection and the continuing
seasonal influenza epidemics because immunity generated
against one viral quasispecies member selects for escape
variants. Attributed in part to antigenic drift are the
moderately high failure rate and the short-lived efficacy
of influenza vaccines [ 6], the failure of synthetic
foot-and-mouth disease virus vaccines [ 7], and the
inability of recombinant HIV vaccines to provide complete
protection against field strains of the virus [ 8].
The hemagglutinin (HA) envelope surface
glycoprotein--the major neutralizing determinant of
influenza A--is a classic example of an antigenically
drifting protein [ 9]. Walter Gerhard and colleagues
demonstrated that the immune pressure exerted by
monoclonal antibodies (Abs) selects for HA escape mutants
in model systems [ 10, 11]. Later, Dimmock and colleagues
showed that polyclonal anti-sera also select for escape
mutants [ 12, 13]. Similarly, much of the observed
variability of glycoprotein 120 (gp120), the principal
surface antigen of HIV, is thought to reflect antigenic
drift [ 14, 15, 16, 17]. The correlation of intra-patient
viral diversity with immune response strength has been
cited as evidence that the immune response is a selective
factor in HIV antigenic drift [ 18, 19, 20, 21, 22].
Phylogenetic analyses describe divergence within a
viral population, and these methods have been used to
infer the selective advantages of viral variation [ 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]. A more
direct indication of the selective advantage gained
through variation is an observed overabundance of
replacement mutations relative to silent mutations in
viral proteins [ 31]. Such analyses of gp120 and its V
regions indicate that replacement mutations are generally
over-represented in this protein and thus appear to
confer selective advantage to HIV-1 [ 22, 24, 25, 26, 27,
28, 29, 30, 32, 33, 34, 35]. In more detailed analyses,
several groups tested individual codons for replacement
mutations that are, as an aggregate, overabundant [ 23,
36, 37, 38]. However, none of these methods determine
which replacement mutations are actually positively
selected. Also, when replacement mutations of varying
fitness are lumped together, positively selected
mutations may remain undetected among negatively selected
mutations.
To overcome these limitations, we have developed a
"selection mapping" algorithm. The cornerstone of
selection mapping is the testing of each observed
replacement mutation at each codon to identify those
particular replacement mutations that are overabundant
relative to silent mutations at that codon. Such
replacement mutations are determined to be positively
selected. Negatively selected variants are recognized as
"noise" and are thereafter ignored. Here, we use the
selection mapping method to identify the positively
selected variants of influenza A HA (H3 serotype), HIV-1
reverse transcriptase (RT), and HIV-1 gp120.
Results and Discussion
Selection map of influenza A H3
hemagglutinin
QUASI identifies 25 HA codons where one or more
replacement mutations are positively selected in the
influenza A H3 virus (Fig. 1). [From our neutral drift
testing of QUASI, we expect a maximum of about 2-3 false
positives (see Materials and Methods)]. The distribution
of these positively selected codons is of particular
interest. Without exception, the codons where variants
are positively selected are on the HA surface (Fig. 2,
left). The parsimonious explanation of this result is
that HA variants are primarily selected for escape from
B-cell immunity. If T-cell immunity is instead the
primary selective force affecting HA variation, then
either all T-cell immunity escape variants are
coincidentally solvent-exposed, or HA T-cell epitopes are
determined by Ab protection [ 39]. These 25 codons
include 13 outside those identified as positively
selected by Walter Fitch and colleagues [ 40]. We
attribute our new findings primarily to sites where many
variants are negatively selected but where at least one
variant is positively selected. The Fitch group also
identifies 6 additional codons as positively selected. We
believe that these are false positives caused by the
Fitch group's assumption that HA is, on average,
neutrally drifting; these six codons may be less
negatively selected than average, but they nevertheless
appear to be negatively selected. Additionally, our data,
while largely the same as those analyzed by the Fitch
group, do have some differences.
Wiley
et al . proposed four antigenic
sites where field and laboratory mutations could be
grouped on the HA surface [ 41]. These putative antigenic
sites are indicated in Figure 1and the right side of
Figure 2. Positively selected variants are correlated
(Fisher's exact test) with antigenic site A (p = 5.27 ×
10 -3), antigenic site B (p = 1.12 × 10 -3), and
antigenic site C (p = 1.0 × 10 -5). In contrast,
antigenic site D is not particularly correlated with
positively selected variants. We believe the lack of
positively selected variants spanning antigenic site D
may explain a decades-old puzzle. In the fully assembled
HA protein, site D is buried in the trimer interface and
therefore is not generally accessible to Ab [ 41, 42]. At
the time that the antigenic sites were proposed, residues
in and around the trimer interface were variable and
found on the monomer surface, so they were grouped and
labeled as antigenic site D even though it was unclear
how site D was recognized by Ab [ 41]. In fact, the only
two positively selected variants QUASI identifies in site
D are solvent-exposed at the extreme edge of the HA
trimer (Fig. 2). Based on QUASI's selection map of HA, we
now conclude that those mutations found in the
trimer-buried portion of HA do not confer significant
advantage on influenza A. That is, while the
trimer-buried portion of antigenic site D includes the
sound and fury of variability, it signifies nothing.
While QUASI finds that antigenic sites A-C are
significantly associated with positively selected
variation, this association may be a simple consequence
of the sites' surface exposure. As can be seen in the
left-hand side of Figure 2, positively selected variants
are scattered across the entire exposed surface of HA.
There are positively selected variants outside the
antigenic sites, and there are subregions of the
antigenic sites where variation is negatively selected.
Thus, it may be more appropriate to view antigenic sites
as more positional than functional. Indeed, more recent
demarcations of antigenic sites enlarge antigenic sites
A-D and add an additional antigenic site, site E, and
these sites are now considered primarily positional
rather than functional [ 6, 43].
Selection map of HIV-1 gp120
At 123 codons, QUASI's selection map of gp120
indicates that one or more non-consensus amino acids are
positively selected in HIV-1 (Fig. 3). Most positively
selected variants appear to be on the gp120 surface (not
shown), but in contrast to HA, gp120 includes several
monomer-buried positively selected variants (at sites
I225, V270, N295, H333, I345, T387, I424, and L453).
Additionally, some of the positively selected variants
that appear to be solvent-exposed may normally be buried
(some loops are absent from the core protein crystal
structure [ 44]). The burial of positively selected
variants in the gp120 monomer confirms that gp120
quasispeciation is not selected solely for escape from
B-cell immunity.
Two competition-group epitopes have been identified
for broadly neutralizing anti-gp120 Abs: the CD4-binding
site (CD4BS) and the CD4-induced (CD4i) epitopes
(references in [ 45]). Each epitope includes only a
single non-consensus positively selected variant (Fig.
3). Thus, broadly neutralizing Abs appear to be those
that engage few protein positions where variation is
positively selected. Based on this observation, we
propose that the neutralizing spectra of Abs may be
predicted if the epitopes are known. For example, we
would predict that the anti-CD4BS Abs will have
particular trouble recognizing gp120 molecules carrying
the positively selected variant of the CD4BS epitope (D→N
at codon 474). This prediction is fulfilled in that the
15e anti-CD4BS monoclonal Ab fails to react to gp120 from
HIV strain RF, a strain that carries this D→N mutation [
46]. We also predict that gp120 molecules carrying the
positively selected I→F mutation at codon 423 will be
poorly recognized by the anti-CD4i Abs.
It is worth commenting on the GPGRAF motif (gp120
residues 312-317) that is sometimes (though increasingly
rarely) referred to as "highly conserved." Because QUASI
identifies non-consensus variants at two codons of this
motif as positively selected (Fig. 3), it may be
inappropriate to refer to the GPGRAF motif as "highly
conserved." Instead, five positively selected variants
appear to exist at this region in addition to GPGRAF:
GPGKAF, GPGRTF, GPGKTF, GPGRVF, and GPGKVF.
Kwong
et al. roughly divide the gp120
three-dimensional structure into outer (β9-β19 and
β22-β24) and inner (N-α1, β4-β8, and α5-C) domains joined
by a bridging sheet (β2, β3, β20, and β21) [ 44]. As
indicated by Kwong
et al. , all three domains include
variable regions; as QUASI shows, diversity in all three
regions is positively selected (Fig. 3). The selective
advantage we find rendered by diversity in some of these
regions (
e.g. , the "silent face") has been
attributed to neutral drift [ 44, 45], thus QUASI's
results run counter to previous interpretations. That is,
QUASI finds that diversity in regions proposed to be
inaccessible to the immune system nevertheless confers
selective advantage on HIV. QUASI's findings may be
consistent with the existence of gaps in the carbohydrate
groups thought to mask the silent face from gp120 from
immune surveillance. Interestingly, in
carbohydrate-building models of gp120, these gaps
correspond to codons where we find variation is
positively selected (P. Kwong, pers. commun.).
A "non-neutralizing" face has been identified where
binding Abs generally do not neutralize HIV when gp120 is
oligomerized [ 45, 47]; these data were interpreted as
indicating that the non-neutralizing face is occluded in
the trimer and that binding Abs are raised against shed
gp120 monomers. However, QUASI finds numerous positively
selected variants on the non-neutralizing face of the
inner domain. Assuming the "non-neutralizing" appellation
is appropriate, how do mutations on this face provide
selective advantage to the virus? The obvious answer is
that mutations provide escape not from direct B-cell
immunity but from other levels of immunity, such as
T-cell immunity [including major histocompatibility
complex (MHC) presentation] or indirect Ab immunity via
Ab-dependent cellular cytotoxicity (where gp120 molecules
found on infected cell surfaces are monomers).
To determine if HIV-1 viral sequences retain evidence
that T-cell immunity is a significant selective force
affecting HIV quasispeciation, we used QUASI to generate
a selection map of HIV-1 RT (Fig. 4). Because RT is not a
surface-expressed protein, it is not plausible that the
positively selected variants of RT have been selected by
direct B-cell immunity.
A priori , RT quasispeciation could
have been the result of neutral drift, but because QUASI
finds that replacement mutations confer selective
advantage on the virus, we reject the neutral drift
hypothesis at the 22 RT codons. If T-cell immunity
(including MHC presentation) is a selective pressure
shaping RT quasispeciation, positively selected variants
should be associated with T-cell epitopes. When known
T-cell epitopes [ 48] are plotted on the RT selection
map, the positively selected variants are found to
localize significantly (Fisher's exact test) both with
helper T-cell epitopes (p = 3.27 × 10 -2) and CTL
epitopes (p = 6.58 × 10 -3). We conclude that T-cell
immunity is a significant selection pressure shaping the
quasispeciation of RT and presumably is a significant
factor in the quasispeciation of other HIV proteins.
Thus, because positively selected gp120 variants found
throughout gp120 may be selected by T-cell immunity,
QUASI's finding that the non-neutralizing face includes
positively selected variants is not at odds with models
where the non-neutralizing face forms the gp120 trimer
interface. Nor are QUASI's results incompatible with the
silent face being silent to B-cell immunity. QUASI's
finding that positively selected variants may be buried
in the gp120 monomer is consistent with escape from
T-cell immunity.
In addition to the selection pressure exerted by
T-cell immunity, 3'-azido-3'-deoxythymidine (AZT) may
also have provided selection pressure for RT
quasispeciation in the sequences selection mapped by
QUASI [ 59, 51]. Indeed, QUASI identifies six of the
eight mutations known to be associated with AZT
resistance [ 52] as positively selected (N67, R70, W210,
Y215, and F215) or possibly positively selected (L41) to
HIV-1 (Fig. 4). The exceptions, two mutations of codon
219, are informative. Whereas mutations at other codons
are necessary for high resistance to AZT, mutations at
codon 219 are not, and codon 219 mutations arise late in
infection after earlier mutations have already rendered
RT resistant to AZT [ 53]. We conclude that the
additional AZT resistance conferred by codon 219
mutations did not provide significant selective advantage
to the profiled HIV viruses, possibly because HIV had
already acquired the maximum effective AZT resistance
selectable,
in vivo , when mutations arose at
this codon. Alternatively, the lysine at codon 219 may be
important for proper
in vivo RT function such that the
advantage conferred by increased AZT resistance does not
adequately compensate for impaired RT function. The RT
sequences we analyze were taken from patients who either
had no anti-RT treatment or were treated mainly with AZT
(though some patients who were treated with AZT were also
treated with 2',3'-dideoxyinosine) [ 49, 50, 51].
Therefore, we would predict that mutations associated
with resistance to other anti-RT drugs should not be
positively selected in the sequences QUASI analyzed. As
predicted, QUASI identifies none of the 50 RT mutation
associated with resistance to other drugs as positively
selected (compare Figure 4to the Los Alamos database [
52]).
Conclusion
We have developed an algorithm for using sequence data
to map the positively selected mutations of viral
quasispecies. We have used this method to map the
positively selected variants of influenza A HA, HIV-1 RT,
and HIV-1 gp120. Other obvious targets for selection
mapping are the hepatitis C and foot-and-mouth disease
viruses. We believe that potentially the most illuminating
use of selection mapping may be the comparison of viral
subpopulations to determine which variants are advantageous
under different selective pressures. For example, selection
mapping of HIV isolates with different cellular tropisms
will allow the determination of mutations that are
positively selected depending on the host cell type. Also,
we may use selection mapping to analyze HIV breakthrough
infections to determine if vaccines prevented the HIV
quasispecies from inhabiting normally advantageous regions
of the quasispecies sequence space. Finally, we propose
that the positively selected viral variants (as opposed to
all viral variants) should be included in future, highly
multivalent vaccines designed to compensate for
B-cell-selected antigenic drift.
Materials and Methods
QUASI--the selection mapping algorithm
An executable version of the QUASI software is
attached as an additional file (see additional file 1).
Also attached are a users' manual (user.txt - see
additional file 2) and a FASTA to QUASI file converter
PERL script (F2Q.pl - additional file 3). Current
versions of QUASI are available from the authors or may
be accessed at the Los Alamos Influenza Sequence Database
(http://www.flu.lanl.gov/).
For a set of viral nucleotide sequences, we determine
the variants that confer selective advantage by measuring
the empirical replacement to silent mutation ratio (R:S)
of each possible amino acid replacement and then
comparing this observed ratio to that which would be
expected if mutation were unselected. An R:S that is
found to be higher than expected indicates that the
replacement mutation tested is positively selected, while
a lower-than-expected observed R:S indicates that the
tested replacement mutation is negatively selected.
Testing for an overabundance of replacements across a
protein as a whole is a reasonable approach when only a
few nucleotide sequences are available, but because a
large number of mutated viral sequences are currently
available, such aggregation is unnecessarily crude.
Better are approaches that test for an overabundance of
replacement mutations at individual codons [ 23, 36, 37,
38]. However, these methods lump together replacement
mutations and thus allow negatively selected mutations to
conceal positively selected mutations and
vice versa (
e.g. , replacement mutations at a
codon may be negatively selected as a group despite the
fact that one or more particular replacement mutations
are positively selected).
To overcome these limitations, the QUASI algorithm
does not test the overall R:S of the entire protein as an
aggregate, nor does QUASI test the R:S of a codon to all
its replacement mutations taken as a whole. Rather, QUASI
tests the R:S of each particular replacement mutation at
each codon. That is, QUASI measures the R:S of the
mutations from a consensus codon towards each individual
replacement amino. For example, if the consensus codon at
a protein position were ttt (Phe), QUASI would test the
R:Ss of all point mutations from ttt. One of these
mutations is ttt→tat (Tyr). QUASI calculates the expected
R:S for ttt→tat under the null hypothesis of neutral
drift. The expected S is one because only one mutation
(ttc) is silent, and the expected R is also one because
only one point mutation of ttt (tat) codes for Tyr, so
the expected R:S is one, in this case. If QUASI rejects
the (Jukes-Cantor) neutral drift null hypothesis because
the observed R:S is significantly higher than one, then
QUASI classifies this replacement mutation (Tyr) as
positively selected. Conversely, if QUASI rejects the
null hypothesis because the observed R:S is significantly
lower than one, then QUASI determines that this
replacement mutation is negatively selected. QUASI
performs this procedure for all replacement point
mutations [
e.g. , in the example case, Tyr
(tat), Ile (att), Leu (tta, ttg, and ctt), Val (gtt), Ser
(tct), and Cys (tgt)].
In this paper, selection mapping is carried out
independent of the underlying phylogeny. QUASI uses R:S
to reject the null hypothesis that the mutational space
surrounding the consensus codon is distributed randomly
among all nine possible R or S point mutations (except
stop codons, which are considered to be disallowed). This
allows R:S calculations to be applied to viral sequences
whose ancestral sequence is unclear or unknown. This is a
both an advantage and a disadvantage over analyses that
rely on phylogeny. Phylogeny is difficult to determine
accurately and uniquely, and relying on phylogeny ignores
the persistence of positively selected replacement
mutations (the major effect of selection). On a practical
level, using phylogeny to reconstruct viruses' mutational
histories and then using intuited mutations leaves one
with insufficient data to determine positively-selected
codons [ 36] unless, as some have done, one assumes
observed drift is neutral and then tests for codons where
selection is more positive than average [ 23, 38]. The
significance problem can be compounded when one is
looking for independent occurrences of particular
mutations; often, there simply has not been enough
sequence evolution in HIV or influenza to map positively
selected variants if the retention of positively selected
mutations is ignored. The drawback of ignoring phylogeny
is a potentially high false positive rate (see
below).
Empirical R:S is compared to neutral R:S by means of a
two-sided test of the binomial distribution. For each
codon, we test the null hypothesis that all nine point
mutants are equally probable. The quotient
p = R/(R+S) is the probability of a
replacement mutation at this codon if each nucleotide is
equally mutable and each of the three mutational targets
at that codon are equally likely. The numerator, R, is
the number of point mutations that lead from the
consensus codon to the target amino acid. The chance of
observing
r replacement mutations is given by
the binomial distribution, , where
n is the number of codons providing
data for this position. To form a two-sided test, we sum
all terms
b (
kn ,,
p ) such that
b (
kn ,,
p ) is not greater than
b (
rn ,
p ,), where
k is in the set (0,...,
n ) and
r is the number of observed
replacement mutations. In other words, we sum the chances
of all events that are no more likely than that of the
observation. If this sum, α, is small (
e.g. , not greater than 0.05), we
reject the null hypothesis at the α level of
significance.
Working example
We analyze the following scenario as a working example
(Table 1).
Additional file 1
Click here for file
Additional file 2
Click here for file
Additional file 3
Click here for file
The consensus codon is given as ttt (Phe; Table 1,
column 1). Each observed mutation is also given (Table 1,
columns 1-3)
Because we know the frequency of silent mutations
(given as 10 in this example; Table 1, column 3), we also
know the expected R:S for each replacement mutation
(Table 1, column 4). That is, if selection is neutral for
any particular replacement mutation, we can calculate the
incidence of each replacement mutation we expect to
observe (by looking at a table of the genetic code).
Using the given frequency of each mutation observed,
we also know what the observed R:S is in each case (Table
1, column 5).
Now we use a two-tailed test of the binomial
distribution to determine if each observed R:S is
significantly different from the corresponding expected
R:S (Table 1, column 6). In some cases, the differences
are significant, in which case the appropriate
replacement mutation is assigned positive or negative
selection (positive if the observed R:S is larger than
expected and negative if the observed R:S is lower than
expected). Otherwise, the selection is assigned to be
neutral drift.
The QUASI algorithm thus indicates at this exemplary
codon that both tyrosine and serine are positively
selected; isoleucine, leucine, and cystine are negatively
selected; and the selective advantage or disadvantage of
valine is indistinguishable from neutral drift. Any other
ttt codon will have its own selective pressures assessed
in a similar but independent testing procedure.
Minimum sequences
For each possible replacement mutation, a minimum
number of mutations will need to be observed for a
selective event to be detected. This minimum number
differs depending on the consensus codon and the level of
significance. We have calculated the minimum number of
mutations needed to achieve the 5% significance level. At
the lower bound of this range, only 2 replacement
mutations will be required to detect positive selection
for any replacement mutation from cta or ctg. Any
replacement mutation from cta or ctg has an expected R:S
of 1:4, and thus an observed R:S of 2:0 will be
sufficient to reject neutral drift in favor of positive
selection. Conversely, detecting negative selection at a
cta or ctg codon is difficult. At the upper bound of the
range, a minimum of 17 observed mutations are required to
detect negative selection at such a codon (0:17 is
significantly lower than 1:4). For the most-typical
codon, the expected R:S is 1:3. For these modal codons,
at least 3 mutations must be observed to detect positive
selection (
i.e. , if observed R:S = 3:0). At
the same most-typical codon, 12 mutations are required,
at a minimum, to detect negative selection (
i.e. , if observed R:S = 0:12).
Because identification of positive selection is generally
the goal, the QUASI algorithm appears to have a practical
advantage over extant selection detectors, which require
either many more mutations or a biased expectation of
neutral drift in order to detect positive selection.
False-positive testing
False positives are likeliest when drift is completely
neutral (
e.g. , as was found by Suzuki and
Gojobori [ 36]). One may estimate the maximum frequency
of false positives by testing simulated sequences
generated under neutral drift parameters. We used the
EVOLVER program of the PAML package [ 54] to generate
sequences drifting under neutral Jukes-Cantor evolution.
For each simulation, we generated 300 related sequences
of length 999 and with average branch lengths varying in
0.1 length intervals from 0.1 to 1.0; each parameter set
was used to generate 10 sets of 300 sequences. False
positive percentages [Fig. 5; false positive percentage =
false positives / (false positives + neutral drift
variants)] were extremely low (∼ 2%) for relatively long
branch lengths (0.1-1.0). Extremely high false positive
rates (up to 70%) were found for extremely short branch
lengths (maximum at 0.001). As accurate branch lengths
are calculated using maximum likelihood phylogeny, these
branch lengths may be used to find the appropriate
false-positive percentage. For instance, if the branch
length were 0.01 [as appears appropriate for HA
(unpublished observation)], then the maximum false
positive rate would be estimated at 32%. We then use the
calculated neutral drift frequency (from QUASI) as an
estimator for the maximum false positives. If (as we
report) HA is found to have 5 neutral drift variants,
then we would expect a maximum of 2.3
false-positives.
Selection mapping
QUASI presents its results in the following
format:
1. The consensus amino acids are written in capital
letters.
2. Beneath each consensus amino acid are written in
capital letters all variants determined to be positively
selected (in descending order of frequency).
3. The negatively selected variants are not shown.
4. In lowercase letters and interspersed according to
their frequencies among the positively selected variants
are variants where the neutral drift null hypothesis
cannot be rejected with the given sequence data. As a
reasonable but arbitrary cut-off, we include apparently
unselected variants if they are among the 2 H+σmost
frequent variants, where
H is the Shannon information
content of the site and σ is the standard error of its
estimation [ 55]. is the
i th fraction of amino acids at the
site (the alignment gap is counted as a 21st amino acid).
For the Shannon calculation, alignment gaps are
considered distinct from "no data" gaps (artifacts of
indeterminate sequencing or sequence fragment overlaps;
such data absences are excluded from calculation).
Sequences
Nucleotide sequences were downloaded from GenBank at
the NIH. Sequences were included only if they were
isolated in the field and were not obviously pseudogenes
(sequences containing premature stop codons were removed
from consideration). We analyzed 310 sequences of
human-infective influenza A (H3 serotype), 6,151 HIV-1
gp120 sequences, and 400 HIV-1 RT sequences. All
sequences were pre-aligned with PILEUP [ 56] and/or
DIALIGN2 [ 57] then hand-corrected.
Abbreviations
HIV, human immunodeficiency virus; HA, hemagglutinin;
Ab, antibody; gp120, glycoprotein 120; RT, reverse
transcriptase; R:S, replacement to silent mutation ratio;
CD4BS, CD4 binding site epitope; CD4i, CD4-induced epitope;
MHC, major histocompatibility complex; AZT,
3'-azido-3'-deoxythymidine.