Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download
29547 views
1
2
3
4
5
Background
6
7
Antigenic drift and the generation of viral
8
quasispecies
9
Some RNA viruses form a quasispecies--a set of related
10
viral variants that coexist in field populations and even
11
within single infected individuals (reviewed in [ 1, 2,
12
3, 4, 5]). The emergence of immunologically distinct
13
members of a viral quasispecies through mutation and
14
subsequent immune selection is called "antigenic drift."
15
Antigenic drift is thought to be important in human
16
immunodeficiency virus (HIV) infection and the continuing
17
seasonal influenza epidemics because immunity generated
18
against one viral quasispecies member selects for escape
19
variants. Attributed in part to antigenic drift are the
20
moderately high failure rate and the short-lived efficacy
21
of influenza vaccines [ 6], the failure of synthetic
22
foot-and-mouth disease virus vaccines [ 7], and the
23
inability of recombinant HIV vaccines to provide complete
24
protection against field strains of the virus [ 8].
25
The hemagglutinin (HA) envelope surface
26
glycoprotein--the major neutralizing determinant of
27
influenza A--is a classic example of an antigenically
28
drifting protein [ 9]. Walter Gerhard and colleagues
29
demonstrated that the immune pressure exerted by
30
monoclonal antibodies (Abs) selects for HA escape mutants
31
in model systems [ 10, 11]. Later, Dimmock and colleagues
32
showed that polyclonal anti-sera also select for escape
33
mutants [ 12, 13]. Similarly, much of the observed
34
variability of glycoprotein 120 (gp120), the principal
35
surface antigen of HIV, is thought to reflect antigenic
36
drift [ 14, 15, 16, 17]. The correlation of intra-patient
37
viral diversity with immune response strength has been
38
cited as evidence that the immune response is a selective
39
factor in HIV antigenic drift [ 18, 19, 20, 21, 22].
40
Phylogenetic analyses describe divergence within a
41
viral population, and these methods have been used to
42
infer the selective advantages of viral variation [ 18,
43
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]. A more
44
direct indication of the selective advantage gained
45
through variation is an observed overabundance of
46
replacement mutations relative to silent mutations in
47
viral proteins [ 31]. Such analyses of gp120 and its V
48
regions indicate that replacement mutations are generally
49
over-represented in this protein and thus appear to
50
confer selective advantage to HIV-1 [ 22, 24, 25, 26, 27,
51
28, 29, 30, 32, 33, 34, 35]. In more detailed analyses,
52
several groups tested individual codons for replacement
53
mutations that are, as an aggregate, overabundant [ 23,
54
36, 37, 38]. However, none of these methods determine
55
which replacement mutations are actually positively
56
selected. Also, when replacement mutations of varying
57
fitness are lumped together, positively selected
58
mutations may remain undetected among negatively selected
59
mutations.
60
To overcome these limitations, we have developed a
61
"selection mapping" algorithm. The cornerstone of
62
selection mapping is the testing of each observed
63
replacement mutation at each codon to identify those
64
particular replacement mutations that are overabundant
65
relative to silent mutations at that codon. Such
66
replacement mutations are determined to be positively
67
selected. Negatively selected variants are recognized as
68
"noise" and are thereafter ignored. Here, we use the
69
selection mapping method to identify the positively
70
selected variants of influenza A HA (H3 serotype), HIV-1
71
reverse transcriptase (RT), and HIV-1 gp120.
72
73
74
75
Results and Discussion
76
77
Selection map of influenza A H3
78
hemagglutinin
79
QUASI identifies 25 HA codons where one or more
80
replacement mutations are positively selected in the
81
influenza A H3 virus (Fig. 1). [From our neutral drift
82
testing of QUASI, we expect a maximum of about 2-3 false
83
positives (see Materials and Methods)]. The distribution
84
of these positively selected codons is of particular
85
interest. Without exception, the codons where variants
86
are positively selected are on the HA surface (Fig. 2,
87
left). The parsimonious explanation of this result is
88
that HA variants are primarily selected for escape from
89
B-cell immunity. If T-cell immunity is instead the
90
primary selective force affecting HA variation, then
91
either all T-cell immunity escape variants are
92
coincidentally solvent-exposed, or HA T-cell epitopes are
93
determined by Ab protection [ 39]. These 25 codons
94
include 13 outside those identified as positively
95
selected by Walter Fitch and colleagues [ 40]. We
96
attribute our new findings primarily to sites where many
97
variants are negatively selected but where at least one
98
variant is positively selected. The Fitch group also
99
identifies 6 additional codons as positively selected. We
100
believe that these are false positives caused by the
101
Fitch group's assumption that HA is, on average,
102
neutrally drifting; these six codons may be less
103
negatively selected than average, but they nevertheless
104
appear to be negatively selected. Additionally, our data,
105
while largely the same as those analyzed by the Fitch
106
group, do have some differences.
107
Wiley
108
et al . proposed four antigenic
109
sites where field and laboratory mutations could be
110
grouped on the HA surface [ 41]. These putative antigenic
111
sites are indicated in Figure 1and the right side of
112
Figure 2. Positively selected variants are correlated
113
(Fisher's exact test) with antigenic site A (p = 5.27 ×
114
10 -3), antigenic site B (p = 1.12 × 10 -3), and
115
antigenic site C (p = 1.0 × 10 -5). In contrast,
116
antigenic site D is not particularly correlated with
117
positively selected variants. We believe the lack of
118
positively selected variants spanning antigenic site D
119
may explain a decades-old puzzle. In the fully assembled
120
HA protein, site D is buried in the trimer interface and
121
therefore is not generally accessible to Ab [ 41, 42]. At
122
the time that the antigenic sites were proposed, residues
123
in and around the trimer interface were variable and
124
found on the monomer surface, so they were grouped and
125
labeled as antigenic site D even though it was unclear
126
how site D was recognized by Ab [ 41]. In fact, the only
127
two positively selected variants QUASI identifies in site
128
D are solvent-exposed at the extreme edge of the HA
129
trimer (Fig. 2). Based on QUASI's selection map of HA, we
130
now conclude that those mutations found in the
131
trimer-buried portion of HA do not confer significant
132
advantage on influenza A. That is, while the
133
trimer-buried portion of antigenic site D includes the
134
sound and fury of variability, it signifies nothing.
135
While QUASI finds that antigenic sites A-C are
136
significantly associated with positively selected
137
variation, this association may be a simple consequence
138
of the sites' surface exposure. As can be seen in the
139
left-hand side of Figure 2, positively selected variants
140
are scattered across the entire exposed surface of HA.
141
There are positively selected variants outside the
142
antigenic sites, and there are subregions of the
143
antigenic sites where variation is negatively selected.
144
Thus, it may be more appropriate to view antigenic sites
145
as more positional than functional. Indeed, more recent
146
demarcations of antigenic sites enlarge antigenic sites
147
A-D and add an additional antigenic site, site E, and
148
these sites are now considered primarily positional
149
rather than functional [ 6, 43].
150
151
152
Selection map of HIV-1 gp120
153
At 123 codons, QUASI's selection map of gp120
154
indicates that one or more non-consensus amino acids are
155
positively selected in HIV-1 (Fig. 3). Most positively
156
selected variants appear to be on the gp120 surface (not
157
shown), but in contrast to HA, gp120 includes several
158
monomer-buried positively selected variants (at sites
159
I225, V270, N295, H333, I345, T387, I424, and L453).
160
Additionally, some of the positively selected variants
161
that appear to be solvent-exposed may normally be buried
162
(some loops are absent from the core protein crystal
163
structure [ 44]). The burial of positively selected
164
variants in the gp120 monomer confirms that gp120
165
quasispeciation is not selected solely for escape from
166
B-cell immunity.
167
Two competition-group epitopes have been identified
168
for broadly neutralizing anti-gp120 Abs: the CD4-binding
169
site (CD4BS) and the CD4-induced (CD4i) epitopes
170
(references in [ 45]). Each epitope includes only a
171
single non-consensus positively selected variant (Fig.
172
3). Thus, broadly neutralizing Abs appear to be those
173
that engage few protein positions where variation is
174
positively selected. Based on this observation, we
175
propose that the neutralizing spectra of Abs may be
176
predicted if the epitopes are known. For example, we
177
would predict that the anti-CD4BS Abs will have
178
particular trouble recognizing gp120 molecules carrying
179
the positively selected variant of the CD4BS epitope (D→N
180
at codon 474). This prediction is fulfilled in that the
181
15e anti-CD4BS monoclonal Ab fails to react to gp120 from
182
HIV strain RF, a strain that carries this D→N mutation [
183
46]. We also predict that gp120 molecules carrying the
184
positively selected I→F mutation at codon 423 will be
185
poorly recognized by the anti-CD4i Abs.
186
It is worth commenting on the GPGRAF motif (gp120
187
residues 312-317) that is sometimes (though increasingly
188
rarely) referred to as "highly conserved." Because QUASI
189
identifies non-consensus variants at two codons of this
190
motif as positively selected (Fig. 3), it may be
191
inappropriate to refer to the GPGRAF motif as "highly
192
conserved." Instead, five positively selected variants
193
appear to exist at this region in addition to GPGRAF:
194
GPGKAF, GPGRTF, GPGKTF, GPGRVF, and GPGKVF.
195
Kwong
196
et al. roughly divide the gp120
197
three-dimensional structure into outer (β9-β19 and
198
β22-β24) and inner (N-α1, β4-β8, and α5-C) domains joined
199
by a bridging sheet (β2, β3, β20, and β21) [ 44]. As
200
indicated by Kwong
201
et al. , all three domains include
202
variable regions; as QUASI shows, diversity in all three
203
regions is positively selected (Fig. 3). The selective
204
advantage we find rendered by diversity in some of these
205
regions (
206
e.g. , the "silent face") has been
207
attributed to neutral drift [ 44, 45], thus QUASI's
208
results run counter to previous interpretations. That is,
209
QUASI finds that diversity in regions proposed to be
210
inaccessible to the immune system nevertheless confers
211
selective advantage on HIV. QUASI's findings may be
212
consistent with the existence of gaps in the carbohydrate
213
groups thought to mask the silent face from gp120 from
214
immune surveillance. Interestingly, in
215
carbohydrate-building models of gp120, these gaps
216
correspond to codons where we find variation is
217
positively selected (P. Kwong, pers. commun.).
218
A "non-neutralizing" face has been identified where
219
binding Abs generally do not neutralize HIV when gp120 is
220
oligomerized [ 45, 47]; these data were interpreted as
221
indicating that the non-neutralizing face is occluded in
222
the trimer and that binding Abs are raised against shed
223
gp120 monomers. However, QUASI finds numerous positively
224
selected variants on the non-neutralizing face of the
225
inner domain. Assuming the "non-neutralizing" appellation
226
is appropriate, how do mutations on this face provide
227
selective advantage to the virus? The obvious answer is
228
that mutations provide escape not from direct B-cell
229
immunity but from other levels of immunity, such as
230
T-cell immunity [including major histocompatibility
231
complex (MHC) presentation] or indirect Ab immunity via
232
Ab-dependent cellular cytotoxicity (where gp120 molecules
233
found on infected cell surfaces are monomers).
234
To determine if HIV-1 viral sequences retain evidence
235
that T-cell immunity is a significant selective force
236
affecting HIV quasispeciation, we used QUASI to generate
237
a selection map of HIV-1 RT (Fig. 4). Because RT is not a
238
surface-expressed protein, it is not plausible that the
239
positively selected variants of RT have been selected by
240
direct B-cell immunity.
241
A priori , RT quasispeciation could
242
have been the result of neutral drift, but because QUASI
243
finds that replacement mutations confer selective
244
advantage on the virus, we reject the neutral drift
245
hypothesis at the 22 RT codons. If T-cell immunity
246
(including MHC presentation) is a selective pressure
247
shaping RT quasispeciation, positively selected variants
248
should be associated with T-cell epitopes. When known
249
T-cell epitopes [ 48] are plotted on the RT selection
250
map, the positively selected variants are found to
251
localize significantly (Fisher's exact test) both with
252
helper T-cell epitopes (p = 3.27 × 10 -2) and CTL
253
epitopes (p = 6.58 × 10 -3). We conclude that T-cell
254
immunity is a significant selection pressure shaping the
255
quasispeciation of RT and presumably is a significant
256
factor in the quasispeciation of other HIV proteins.
257
Thus, because positively selected gp120 variants found
258
throughout gp120 may be selected by T-cell immunity,
259
QUASI's finding that the non-neutralizing face includes
260
positively selected variants is not at odds with models
261
where the non-neutralizing face forms the gp120 trimer
262
interface. Nor are QUASI's results incompatible with the
263
silent face being silent to B-cell immunity. QUASI's
264
finding that positively selected variants may be buried
265
in the gp120 monomer is consistent with escape from
266
T-cell immunity.
267
In addition to the selection pressure exerted by
268
T-cell immunity, 3'-azido-3'-deoxythymidine (AZT) may
269
also have provided selection pressure for RT
270
quasispeciation in the sequences selection mapped by
271
QUASI [ 59, 51]. Indeed, QUASI identifies six of the
272
eight mutations known to be associated with AZT
273
resistance [ 52] as positively selected (N67, R70, W210,
274
Y215, and F215) or possibly positively selected (L41) to
275
HIV-1 (Fig. 4). The exceptions, two mutations of codon
276
219, are informative. Whereas mutations at other codons
277
are necessary for high resistance to AZT, mutations at
278
codon 219 are not, and codon 219 mutations arise late in
279
infection after earlier mutations have already rendered
280
RT resistant to AZT [ 53]. We conclude that the
281
additional AZT resistance conferred by codon 219
282
mutations did not provide significant selective advantage
283
to the profiled HIV viruses, possibly because HIV had
284
already acquired the maximum effective AZT resistance
285
selectable,
286
in vivo , when mutations arose at
287
this codon. Alternatively, the lysine at codon 219 may be
288
important for proper
289
in vivo RT function such that the
290
advantage conferred by increased AZT resistance does not
291
adequately compensate for impaired RT function. The RT
292
sequences we analyze were taken from patients who either
293
had no anti-RT treatment or were treated mainly with AZT
294
(though some patients who were treated with AZT were also
295
treated with 2',3'-dideoxyinosine) [ 49, 50, 51].
296
Therefore, we would predict that mutations associated
297
with resistance to other anti-RT drugs should not be
298
positively selected in the sequences QUASI analyzed. As
299
predicted, QUASI identifies none of the 50 RT mutation
300
associated with resistance to other drugs as positively
301
selected (compare Figure 4to the Los Alamos database [
302
52]).
303
304
305
306
Conclusion
307
We have developed an algorithm for using sequence data
308
to map the positively selected mutations of viral
309
quasispecies. We have used this method to map the
310
positively selected variants of influenza A HA, HIV-1 RT,
311
and HIV-1 gp120. Other obvious targets for selection
312
mapping are the hepatitis C and foot-and-mouth disease
313
viruses. We believe that potentially the most illuminating
314
use of selection mapping may be the comparison of viral
315
subpopulations to determine which variants are advantageous
316
under different selective pressures. For example, selection
317
mapping of HIV isolates with different cellular tropisms
318
will allow the determination of mutations that are
319
positively selected depending on the host cell type. Also,
320
we may use selection mapping to analyze HIV breakthrough
321
infections to determine if vaccines prevented the HIV
322
quasispecies from inhabiting normally advantageous regions
323
of the quasispecies sequence space. Finally, we propose
324
that the positively selected viral variants (as opposed to
325
all viral variants) should be included in future, highly
326
multivalent vaccines designed to compensate for
327
B-cell-selected antigenic drift.
328
329
330
Materials and Methods
331
332
QUASI--the selection mapping algorithm
333
An executable version of the QUASI software is
334
attached as an additional file (see additional file 1).
335
Also attached are a users' manual (user.txt - see
336
additional file 2) and a FASTA to QUASI file converter
337
PERL script (F2Q.pl - additional file 3). Current
338
versions of QUASI are available from the authors or may
339
be accessed at the Los Alamos Influenza Sequence Database
340
(http://www.flu.lanl.gov/).
341
For a set of viral nucleotide sequences, we determine
342
the variants that confer selective advantage by measuring
343
the empirical replacement to silent mutation ratio (R:S)
344
of each possible amino acid replacement and then
345
comparing this observed ratio to that which would be
346
expected if mutation were unselected. An R:S that is
347
found to be higher than expected indicates that the
348
replacement mutation tested is positively selected, while
349
a lower-than-expected observed R:S indicates that the
350
tested replacement mutation is negatively selected.
351
Testing for an overabundance of replacements across a
352
protein as a whole is a reasonable approach when only a
353
few nucleotide sequences are available, but because a
354
large number of mutated viral sequences are currently
355
available, such aggregation is unnecessarily crude.
356
Better are approaches that test for an overabundance of
357
replacement mutations at individual codons [ 23, 36, 37,
358
38]. However, these methods lump together replacement
359
mutations and thus allow negatively selected mutations to
360
conceal positively selected mutations and
361
vice versa (
362
e.g. , replacement mutations at a
363
codon may be negatively selected as a group despite the
364
fact that one or more particular replacement mutations
365
are positively selected).
366
To overcome these limitations, the QUASI algorithm
367
does not test the overall R:S of the entire protein as an
368
aggregate, nor does QUASI test the R:S of a codon to all
369
its replacement mutations taken as a whole. Rather, QUASI
370
tests the R:S of each particular replacement mutation at
371
each codon. That is, QUASI measures the R:S of the
372
mutations from a consensus codon towards each individual
373
replacement amino. For example, if the consensus codon at
374
a protein position were ttt (Phe), QUASI would test the
375
R:Ss of all point mutations from ttt. One of these
376
mutations is ttt→tat (Tyr). QUASI calculates the expected
377
R:S for ttt→tat under the null hypothesis of neutral
378
drift. The expected S is one because only one mutation
379
(ttc) is silent, and the expected R is also one because
380
only one point mutation of ttt (tat) codes for Tyr, so
381
the expected R:S is one, in this case. If QUASI rejects
382
the (Jukes-Cantor) neutral drift null hypothesis because
383
the observed R:S is significantly higher than one, then
384
QUASI classifies this replacement mutation (Tyr) as
385
positively selected. Conversely, if QUASI rejects the
386
null hypothesis because the observed R:S is significantly
387
lower than one, then QUASI determines that this
388
replacement mutation is negatively selected. QUASI
389
performs this procedure for all replacement point
390
mutations [
391
e.g. , in the example case, Tyr
392
(tat), Ile (att), Leu (tta, ttg, and ctt), Val (gtt), Ser
393
(tct), and Cys (tgt)].
394
In this paper, selection mapping is carried out
395
independent of the underlying phylogeny. QUASI uses R:S
396
to reject the null hypothesis that the mutational space
397
surrounding the consensus codon is distributed randomly
398
among all nine possible R or S point mutations (except
399
stop codons, which are considered to be disallowed). This
400
allows R:S calculations to be applied to viral sequences
401
whose ancestral sequence is unclear or unknown. This is a
402
both an advantage and a disadvantage over analyses that
403
rely on phylogeny. Phylogeny is difficult to determine
404
accurately and uniquely, and relying on phylogeny ignores
405
the persistence of positively selected replacement
406
mutations (the major effect of selection). On a practical
407
level, using phylogeny to reconstruct viruses' mutational
408
histories and then using intuited mutations leaves one
409
with insufficient data to determine positively-selected
410
codons [ 36] unless, as some have done, one assumes
411
observed drift is neutral and then tests for codons where
412
selection is more positive than average [ 23, 38]. The
413
significance problem can be compounded when one is
414
looking for independent occurrences of particular
415
mutations; often, there simply has not been enough
416
sequence evolution in HIV or influenza to map positively
417
selected variants if the retention of positively selected
418
mutations is ignored. The drawback of ignoring phylogeny
419
is a potentially high false positive rate (see
420
below).
421
Empirical R:S is compared to neutral R:S by means of a
422
two-sided test of the binomial distribution. For each
423
codon, we test the null hypothesis that all nine point
424
mutants are equally probable. The quotient
425
p = R/(R+S) is the probability of a
426
replacement mutation at this codon if each nucleotide is
427
equally mutable and each of the three mutational targets
428
at that codon are equally likely. The numerator, R, is
429
the number of point mutations that lead from the
430
consensus codon to the target amino acid. The chance of
431
observing
432
r replacement mutations is given by
433
the binomial distribution, , where
434
n is the number of codons providing
435
data for this position. To form a two-sided test, we sum
436
all terms
437
b (
438
kn ,,
439
p ) such that
440
b (
441
kn ,,
442
p ) is not greater than
443
b (
444
rn ,
445
p ,), where
446
k is in the set (0,...,
447
n ) and
448
r is the number of observed
449
replacement mutations. In other words, we sum the chances
450
of all events that are no more likely than that of the
451
observation. If this sum, α, is small (
452
e.g. , not greater than 0.05), we
453
reject the null hypothesis at the α level of
454
significance.
455
456
457
Working example
458
We analyze the following scenario as a working example
459
(Table 1).
460
Additional file 1
461
Click here for file
462
Additional file 2
463
Click here for file
464
Additional file 3
465
Click here for file
466
The consensus codon is given as ttt (Phe; Table 1,
467
column 1). Each observed mutation is also given (Table 1,
468
columns 1-3)
469
Because we know the frequency of silent mutations
470
(given as 10 in this example; Table 1, column 3), we also
471
know the expected R:S for each replacement mutation
472
(Table 1, column 4). That is, if selection is neutral for
473
any particular replacement mutation, we can calculate the
474
incidence of each replacement mutation we expect to
475
observe (by looking at a table of the genetic code).
476
Using the given frequency of each mutation observed,
477
we also know what the observed R:S is in each case (Table
478
1, column 5).
479
Now we use a two-tailed test of the binomial
480
distribution to determine if each observed R:S is
481
significantly different from the corresponding expected
482
R:S (Table 1, column 6). In some cases, the differences
483
are significant, in which case the appropriate
484
replacement mutation is assigned positive or negative
485
selection (positive if the observed R:S is larger than
486
expected and negative if the observed R:S is lower than
487
expected). Otherwise, the selection is assigned to be
488
neutral drift.
489
The QUASI algorithm thus indicates at this exemplary
490
codon that both tyrosine and serine are positively
491
selected; isoleucine, leucine, and cystine are negatively
492
selected; and the selective advantage or disadvantage of
493
valine is indistinguishable from neutral drift. Any other
494
ttt codon will have its own selective pressures assessed
495
in a similar but independent testing procedure.
496
497
498
Minimum sequences
499
For each possible replacement mutation, a minimum
500
number of mutations will need to be observed for a
501
selective event to be detected. This minimum number
502
differs depending on the consensus codon and the level of
503
significance. We have calculated the minimum number of
504
mutations needed to achieve the 5% significance level. At
505
the lower bound of this range, only 2 replacement
506
mutations will be required to detect positive selection
507
for any replacement mutation from cta or ctg. Any
508
replacement mutation from cta or ctg has an expected R:S
509
of 1:4, and thus an observed R:S of 2:0 will be
510
sufficient to reject neutral drift in favor of positive
511
selection. Conversely, detecting negative selection at a
512
cta or ctg codon is difficult. At the upper bound of the
513
range, a minimum of 17 observed mutations are required to
514
detect negative selection at such a codon (0:17 is
515
significantly lower than 1:4). For the most-typical
516
codon, the expected R:S is 1:3. For these modal codons,
517
at least 3 mutations must be observed to detect positive
518
selection (
519
i.e. , if observed R:S = 3:0). At
520
the same most-typical codon, 12 mutations are required,
521
at a minimum, to detect negative selection (
522
i.e. , if observed R:S = 0:12).
523
Because identification of positive selection is generally
524
the goal, the QUASI algorithm appears to have a practical
525
advantage over extant selection detectors, which require
526
either many more mutations or a biased expectation of
527
neutral drift in order to detect positive selection.
528
529
530
False-positive testing
531
False positives are likeliest when drift is completely
532
neutral (
533
e.g. , as was found by Suzuki and
534
Gojobori [ 36]). One may estimate the maximum frequency
535
of false positives by testing simulated sequences
536
generated under neutral drift parameters. We used the
537
EVOLVER program of the PAML package [ 54] to generate
538
sequences drifting under neutral Jukes-Cantor evolution.
539
For each simulation, we generated 300 related sequences
540
of length 999 and with average branch lengths varying in
541
0.1 length intervals from 0.1 to 1.0; each parameter set
542
was used to generate 10 sets of 300 sequences. False
543
positive percentages [Fig. 5; false positive percentage =
544
false positives / (false positives + neutral drift
545
variants)] were extremely low (∼ 2%) for relatively long
546
branch lengths (0.1-1.0). Extremely high false positive
547
rates (up to 70%) were found for extremely short branch
548
lengths (maximum at 0.001). As accurate branch lengths
549
are calculated using maximum likelihood phylogeny, these
550
branch lengths may be used to find the appropriate
551
false-positive percentage. For instance, if the branch
552
length were 0.01 [as appears appropriate for HA
553
(unpublished observation)], then the maximum false
554
positive rate would be estimated at 32%. We then use the
555
calculated neutral drift frequency (from QUASI) as an
556
estimator for the maximum false positives. If (as we
557
report) HA is found to have 5 neutral drift variants,
558
then we would expect a maximum of 2.3
559
false-positives.
560
561
562
Selection mapping
563
QUASI presents its results in the following
564
format:
565
1. The consensus amino acids are written in capital
566
letters.
567
2. Beneath each consensus amino acid are written in
568
capital letters all variants determined to be positively
569
selected (in descending order of frequency).
570
3. The negatively selected variants are not shown.
571
4. In lowercase letters and interspersed according to
572
their frequencies among the positively selected variants
573
are variants where the neutral drift null hypothesis
574
cannot be rejected with the given sequence data. As a
575
reasonable but arbitrary cut-off, we include apparently
576
unselected variants if they are among the 2 H+σmost
577
frequent variants, where
578
H is the Shannon information
579
content of the site and σ is the standard error of its
580
estimation [ 55]. is the
581
i th fraction of amino acids at the
582
site (the alignment gap is counted as a 21st amino acid).
583
For the Shannon calculation, alignment gaps are
584
considered distinct from "no data" gaps (artifacts of
585
indeterminate sequencing or sequence fragment overlaps;
586
such data absences are excluded from calculation).
587
588
589
Sequences
590
Nucleotide sequences were downloaded from GenBank at
591
the NIH. Sequences were included only if they were
592
isolated in the field and were not obviously pseudogenes
593
(sequences containing premature stop codons were removed
594
from consideration). We analyzed 310 sequences of
595
human-infective influenza A (H3 serotype), 6,151 HIV-1
596
gp120 sequences, and 400 HIV-1 RT sequences. All
597
sequences were pre-aligned with PILEUP [ 56] and/or
598
DIALIGN2 [ 57] then hand-corrected.
599
600
601
602
Abbreviations
603
HIV, human immunodeficiency virus; HA, hemagglutinin;
604
Ab, antibody; gp120, glycoprotein 120; RT, reverse
605
transcriptase; R:S, replacement to silent mutation ratio;
606
CD4BS, CD4 binding site epitope; CD4i, CD4-induced epitope;
607
MHC, major histocompatibility complex; AZT,
608
3'-azido-3'-deoxythymidine.
609
610
611
612
613