Background
        
          Antigenic drift and the generation of viral
          quasispecies
          Some RNA viruses form a quasispecies--a set of related
          viral variants that coexist in field populations and even
          within single infected individuals (reviewed in [ 1, 2,
          3, 4, 5]). The emergence of immunologically distinct
          members of a viral quasispecies through mutation and
          subsequent immune selection is called "antigenic drift."
          Antigenic drift is thought to be important in human
          immunodeficiency virus (HIV) infection and the continuing
          seasonal influenza epidemics because immunity generated
          against one viral quasispecies member selects for escape
          variants. Attributed in part to antigenic drift are the
          moderately high failure rate and the short-lived efficacy
          of influenza vaccines [ 6], the failure of synthetic
          foot-and-mouth disease virus vaccines [ 7], and the
          inability of recombinant HIV vaccines to provide complete
          protection against field strains of the virus [ 8].
          The hemagglutinin (HA) envelope surface
          glycoprotein--the major neutralizing determinant of
          influenza A--is a classic example of an antigenically
          drifting protein [ 9]. Walter Gerhard and colleagues
          demonstrated that the immune pressure exerted by
          monoclonal antibodies (Abs) selects for HA escape mutants
          in model systems [ 10, 11]. Later, Dimmock and colleagues
          showed that polyclonal anti-sera also select for escape
          mutants [ 12, 13]. Similarly, much of the observed
          variability of glycoprotein 120 (gp120), the principal
          surface antigen of HIV, is thought to reflect antigenic
          drift [ 14, 15, 16, 17]. The correlation of intra-patient
          viral diversity with immune response strength has been
          cited as evidence that the immune response is a selective
          factor in HIV antigenic drift [ 18, 19, 20, 21, 22].
          Phylogenetic analyses describe divergence within a
          viral population, and these methods have been used to
          infer the selective advantages of viral variation [ 18,
          19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]. A more
          direct indication of the selective advantage gained
          through variation is an observed overabundance of
          replacement mutations relative to silent mutations in
          viral proteins [ 31]. Such analyses of gp120 and its V
          regions indicate that replacement mutations are generally
          over-represented in this protein and thus appear to
          confer selective advantage to HIV-1 [ 22, 24, 25, 26, 27,
          28, 29, 30, 32, 33, 34, 35]. In more detailed analyses,
          several groups tested individual codons for replacement
          mutations that are, as an aggregate, overabundant [ 23,
          36, 37, 38]. However, none of these methods determine
          which replacement mutations are actually positively
          selected. Also, when replacement mutations of varying
          fitness are lumped together, positively selected
          mutations may remain undetected among negatively selected
          mutations.
          To overcome these limitations, we have developed a
          "selection mapping" algorithm. The cornerstone of
          selection mapping is the testing of each observed
          replacement mutation at each codon to identify those
          particular replacement mutations that are overabundant
          relative to silent mutations at that codon. Such
          replacement mutations are determined to be positively
          selected. Negatively selected variants are recognized as
          "noise" and are thereafter ignored. Here, we use the
          selection mapping method to identify the positively
          selected variants of influenza A HA (H3 serotype), HIV-1
          reverse transcriptase (RT), and HIV-1 gp120.
        
      
      
        Results and Discussion
        
          Selection map of influenza A H3
          hemagglutinin
          QUASI identifies 25 HA codons where one or more
          replacement mutations are positively selected in the
          influenza A H3 virus (Fig. 1). [From our neutral drift
          testing of QUASI, we expect a maximum of about 2-3 false
          positives (see Materials and Methods)]. The distribution
          of these positively selected codons is of particular
          interest. Without exception, the codons where variants
          are positively selected are on the HA surface (Fig. 2,
          left). The parsimonious explanation of this result is
          that HA variants are primarily selected for escape from
          B-cell immunity. If T-cell immunity is instead the
          primary selective force affecting HA variation, then
          either all T-cell immunity escape variants are
          coincidentally solvent-exposed, or HA T-cell epitopes are
          determined by Ab protection [ 39]. These 25 codons
          include 13 outside those identified as positively
          selected by Walter Fitch and colleagues [ 40]. We
          attribute our new findings primarily to sites where many
          variants are negatively selected but where at least one
          variant is positively selected. The Fitch group also
          identifies 6 additional codons as positively selected. We
          believe that these are false positives caused by the
          Fitch group's assumption that HA is, on average,
          neutrally drifting; these six codons may be less
          negatively selected than average, but they nevertheless
          appear to be negatively selected. Additionally, our data,
          while largely the same as those analyzed by the Fitch
          group, do have some differences.
          Wiley 
          et al . proposed four antigenic
          sites where field and laboratory mutations could be
          grouped on the HA surface [ 41]. These putative antigenic
          sites are indicated in Figure 1and the right side of
          Figure 2. Positively selected variants are correlated
          (Fisher's exact test) with antigenic site A (p = 5.27 ×
          10 -3), antigenic site B (p = 1.12 × 10 -3), and
          antigenic site C (p = 1.0 × 10 -5). In contrast,
          antigenic site D is not particularly correlated with
          positively selected variants. We believe the lack of
          positively selected variants spanning antigenic site D
          may explain a decades-old puzzle. In the fully assembled
          HA protein, site D is buried in the trimer interface and
          therefore is not generally accessible to Ab [ 41, 42]. At
          the time that the antigenic sites were proposed, residues
          in and around the trimer interface were variable and
          found on the monomer surface, so they were grouped and
          labeled as antigenic site D even though it was unclear
          how site D was recognized by Ab [ 41]. In fact, the only
          two positively selected variants QUASI identifies in site
          D are solvent-exposed at the extreme edge of the HA
          trimer (Fig. 2). Based on QUASI's selection map of HA, we
          now conclude that those mutations found in the
          trimer-buried portion of HA do not confer significant
          advantage on influenza A. That is, while the
          trimer-buried portion of antigenic site D includes the
          sound and fury of variability, it signifies nothing.
          While QUASI finds that antigenic sites A-C are
          significantly associated with positively selected
          variation, this association may be a simple consequence
          of the sites' surface exposure. As can be seen in the
          left-hand side of Figure 2, positively selected variants
          are scattered across the entire exposed surface of HA.
          There are positively selected variants outside the
          antigenic sites, and there are subregions of the
          antigenic sites where variation is negatively selected.
          Thus, it may be more appropriate to view antigenic sites
          as more positional than functional. Indeed, more recent
          demarcations of antigenic sites enlarge antigenic sites
          A-D and add an additional antigenic site, site E, and
          these sites are now considered primarily positional
          rather than functional [ 6, 43].
        
        
          Selection map of HIV-1 gp120
          At 123 codons, QUASI's selection map of gp120
          indicates that one or more non-consensus amino acids are
          positively selected in HIV-1 (Fig. 3). Most positively
          selected variants appear to be on the gp120 surface (not
          shown), but in contrast to HA, gp120 includes several
          monomer-buried positively selected variants (at sites
          I225, V270, N295, H333, I345, T387, I424, and L453).
          Additionally, some of the positively selected variants
          that appear to be solvent-exposed may normally be buried
          (some loops are absent from the core protein crystal
          structure [ 44]). The burial of positively selected
          variants in the gp120 monomer confirms that gp120
          quasispeciation is not selected solely for escape from
          B-cell immunity.
          Two competition-group epitopes have been identified
          for broadly neutralizing anti-gp120 Abs: the CD4-binding
          site (CD4BS) and the CD4-induced (CD4i) epitopes
          (references in [ 45]). Each epitope includes only a
          single non-consensus positively selected variant (Fig.
          3). Thus, broadly neutralizing Abs appear to be those
          that engage few protein positions where variation is
          positively selected. Based on this observation, we
          propose that the neutralizing spectra of Abs may be
          predicted if the epitopes are known. For example, we
          would predict that the anti-CD4BS Abs will have
          particular trouble recognizing gp120 molecules carrying
          the positively selected variant of the CD4BS epitope (D→N
          at codon 474). This prediction is fulfilled in that the
          15e anti-CD4BS monoclonal Ab fails to react to gp120 from
          HIV strain RF, a strain that carries this D→N mutation [
          46]. We also predict that gp120 molecules carrying the
          positively selected I→F mutation at codon 423 will be
          poorly recognized by the anti-CD4i Abs.
          It is worth commenting on the GPGRAF motif (gp120
          residues 312-317) that is sometimes (though increasingly
          rarely) referred to as "highly conserved." Because QUASI
          identifies non-consensus variants at two codons of this
          motif as positively selected (Fig. 3), it may be
          inappropriate to refer to the GPGRAF motif as "highly
          conserved." Instead, five positively selected variants
          appear to exist at this region in addition to GPGRAF:
          GPGKAF, GPGRTF, GPGKTF, GPGRVF, and GPGKVF.
          Kwong 
          et al. roughly divide the gp120
          three-dimensional structure into outer (β9-β19 and
          β22-β24) and inner (N-α1, β4-β8, and α5-C) domains joined
          by a bridging sheet (β2, β3, β20, and β21) [ 44]. As
          indicated by Kwong 
          et al. , all three domains include
          variable regions; as QUASI shows, diversity in all three
          regions is positively selected (Fig. 3). The selective
          advantage we find rendered by diversity in some of these
          regions ( 
          e.g. , the "silent face") has been
          attributed to neutral drift [ 44, 45], thus QUASI's
          results run counter to previous interpretations. That is,
          QUASI finds that diversity in regions proposed to be
          inaccessible to the immune system nevertheless confers
          selective advantage on HIV. QUASI's findings may be
          consistent with the existence of gaps in the carbohydrate
          groups thought to mask the silent face from gp120 from
          immune surveillance. Interestingly, in
          carbohydrate-building models of gp120, these gaps
          correspond to codons where we find variation is
          positively selected (P. Kwong, pers. commun.).
          A "non-neutralizing" face has been identified where
          binding Abs generally do not neutralize HIV when gp120 is
          oligomerized [ 45, 47]; these data were interpreted as
          indicating that the non-neutralizing face is occluded in
          the trimer and that binding Abs are raised against shed
          gp120 monomers. However, QUASI finds numerous positively
          selected variants on the non-neutralizing face of the
          inner domain. Assuming the "non-neutralizing" appellation
          is appropriate, how do mutations on this face provide
          selective advantage to the virus? The obvious answer is
          that mutations provide escape not from direct B-cell
          immunity but from other levels of immunity, such as
          T-cell immunity [including major histocompatibility
          complex (MHC) presentation] or indirect Ab immunity via
          Ab-dependent cellular cytotoxicity (where gp120 molecules
          found on infected cell surfaces are monomers).
          To determine if HIV-1 viral sequences retain evidence
          that T-cell immunity is a significant selective force
          affecting HIV quasispeciation, we used QUASI to generate
          a selection map of HIV-1 RT (Fig. 4). Because RT is not a
          surface-expressed protein, it is not plausible that the
          positively selected variants of RT have been selected by
          direct B-cell immunity. 
          A priori , RT quasispeciation could
          have been the result of neutral drift, but because QUASI
          finds that replacement mutations confer selective
          advantage on the virus, we reject the neutral drift
          hypothesis at the 22 RT codons. If T-cell immunity
          (including MHC presentation) is a selective pressure
          shaping RT quasispeciation, positively selected variants
          should be associated with T-cell epitopes. When known
          T-cell epitopes [ 48] are plotted on the RT selection
          map, the positively selected variants are found to
          localize significantly (Fisher's exact test) both with
          helper T-cell epitopes (p = 3.27 × 10 -2) and CTL
          epitopes (p = 6.58 × 10 -3). We conclude that T-cell
          immunity is a significant selection pressure shaping the
          quasispeciation of RT and presumably is a significant
          factor in the quasispeciation of other HIV proteins.
          Thus, because positively selected gp120 variants found
          throughout gp120 may be selected by T-cell immunity,
          QUASI's finding that the non-neutralizing face includes
          positively selected variants is not at odds with models
          where the non-neutralizing face forms the gp120 trimer
          interface. Nor are QUASI's results incompatible with the
          silent face being silent to B-cell immunity. QUASI's
          finding that positively selected variants may be buried
          in the gp120 monomer is consistent with escape from
          T-cell immunity.
          In addition to the selection pressure exerted by
          T-cell immunity, 3'-azido-3'-deoxythymidine (AZT) may
          also have provided selection pressure for RT
          quasispeciation in the sequences selection mapped by
          QUASI [ 59, 51]. Indeed, QUASI identifies six of the
          eight mutations known to be associated with AZT
          resistance [ 52] as positively selected (N67, R70, W210,
          Y215, and F215) or possibly positively selected (L41) to
          HIV-1 (Fig. 4). The exceptions, two mutations of codon
          219, are informative. Whereas mutations at other codons
          are necessary for high resistance to AZT, mutations at
          codon 219 are not, and codon 219 mutations arise late in
          infection after earlier mutations have already rendered
          RT resistant to AZT [ 53]. We conclude that the
          additional AZT resistance conferred by codon 219
          mutations did not provide significant selective advantage
          to the profiled HIV viruses, possibly because HIV had
          already acquired the maximum effective AZT resistance
          selectable, 
          in vivo , when mutations arose at
          this codon. Alternatively, the lysine at codon 219 may be
          important for proper 
          in vivo RT function such that the
          advantage conferred by increased AZT resistance does not
          adequately compensate for impaired RT function. The RT
          sequences we analyze were taken from patients who either
          had no anti-RT treatment or were treated mainly with AZT
          (though some patients who were treated with AZT were also
          treated with 2',3'-dideoxyinosine) [ 49, 50, 51].
          Therefore, we would predict that mutations associated
          with resistance to other anti-RT drugs should not be
          positively selected in the sequences QUASI analyzed. As
          predicted, QUASI identifies none of the 50 RT mutation
          associated with resistance to other drugs as positively
          selected (compare Figure 4to the Los Alamos database [
          52]).
        
      
      
        Conclusion
        We have developed an algorithm for using sequence data
        to map the positively selected mutations of viral
        quasispecies. We have used this method to map the
        positively selected variants of influenza A HA, HIV-1 RT,
        and HIV-1 gp120. Other obvious targets for selection
        mapping are the hepatitis C and foot-and-mouth disease
        viruses. We believe that potentially the most illuminating
        use of selection mapping may be the comparison of viral
        subpopulations to determine which variants are advantageous
        under different selective pressures. For example, selection
        mapping of HIV isolates with different cellular tropisms
        will allow the determination of mutations that are
        positively selected depending on the host cell type. Also,
        we may use selection mapping to analyze HIV breakthrough
        infections to determine if vaccines prevented the HIV
        quasispecies from inhabiting normally advantageous regions
        of the quasispecies sequence space. Finally, we propose
        that the positively selected viral variants (as opposed to
        all viral variants) should be included in future, highly
        multivalent vaccines designed to compensate for
        B-cell-selected antigenic drift.
      
      
        Materials and Methods
        
          QUASI--the selection mapping algorithm
          An executable version of the QUASI software is
          attached as an additional file (see additional file 1).
          Also attached are a users' manual (user.txt - see
          additional file 2) and a FASTA to QUASI file converter
          PERL script (F2Q.pl - additional file 3). Current
          versions of QUASI are available from the authors or may
          be accessed at the Los Alamos Influenza Sequence Database
          (http://www.flu.lanl.gov/).
          For a set of viral nucleotide sequences, we determine
          the variants that confer selective advantage by measuring
          the empirical replacement to silent mutation ratio (R:S)
          of each possible amino acid replacement and then
          comparing this observed ratio to that which would be
          expected if mutation were unselected. An R:S that is
          found to be higher than expected indicates that the
          replacement mutation tested is positively selected, while
          a lower-than-expected observed R:S indicates that the
          tested replacement mutation is negatively selected.
          Testing for an overabundance of replacements across a
          protein as a whole is a reasonable approach when only a
          few nucleotide sequences are available, but because a
          large number of mutated viral sequences are currently
          available, such aggregation is unnecessarily crude.
          Better are approaches that test for an overabundance of
          replacement mutations at individual codons [ 23, 36, 37,
          38]. However, these methods lump together replacement
          mutations and thus allow negatively selected mutations to
          conceal positively selected mutations and 
          vice versa ( 
          e.g. , replacement mutations at a
          codon may be negatively selected as a group despite the
          fact that one or more particular replacement mutations
          are positively selected).
          To overcome these limitations, the QUASI algorithm
          does not test the overall R:S of the entire protein as an
          aggregate, nor does QUASI test the R:S of a codon to all
          its replacement mutations taken as a whole. Rather, QUASI
          tests the R:S of each particular replacement mutation at
          each codon. That is, QUASI measures the R:S of the
          mutations from a consensus codon towards each individual
          replacement amino. For example, if the consensus codon at
          a protein position were ttt (Phe), QUASI would test the
          R:Ss of all point mutations from ttt. One of these
          mutations is ttt→tat (Tyr). QUASI calculates the expected
          R:S for ttt→tat under the null hypothesis of neutral
          drift. The expected S is one because only one mutation
          (ttc) is silent, and the expected R is also one because
          only one point mutation of ttt (tat) codes for Tyr, so
          the expected R:S is one, in this case. If QUASI rejects
          the (Jukes-Cantor) neutral drift null hypothesis because
          the observed R:S is significantly higher than one, then
          QUASI classifies this replacement mutation (Tyr) as
          positively selected. Conversely, if QUASI rejects the
          null hypothesis because the observed R:S is significantly
          lower than one, then QUASI determines that this
          replacement mutation is negatively selected. QUASI
          performs this procedure for all replacement point
          mutations [ 
          e.g. , in the example case, Tyr
          (tat), Ile (att), Leu (tta, ttg, and ctt), Val (gtt), Ser
          (tct), and Cys (tgt)].
          In this paper, selection mapping is carried out
          independent of the underlying phylogeny. QUASI uses R:S
          to reject the null hypothesis that the mutational space
          surrounding the consensus codon is distributed randomly
          among all nine possible R or S point mutations (except
          stop codons, which are considered to be disallowed). This
          allows R:S calculations to be applied to viral sequences
          whose ancestral sequence is unclear or unknown. This is a
          both an advantage and a disadvantage over analyses that
          rely on phylogeny. Phylogeny is difficult to determine
          accurately and uniquely, and relying on phylogeny ignores
          the persistence of positively selected replacement
          mutations (the major effect of selection). On a practical
          level, using phylogeny to reconstruct viruses' mutational
          histories and then using intuited mutations leaves one
          with insufficient data to determine positively-selected
          codons [ 36] unless, as some have done, one assumes
          observed drift is neutral and then tests for codons where
          selection is more positive than average [ 23, 38]. The
          significance problem can be compounded when one is
          looking for independent occurrences of particular
          mutations; often, there simply has not been enough
          sequence evolution in HIV or influenza to map positively
          selected variants if the retention of positively selected
          mutations is ignored. The drawback of ignoring phylogeny
          is a potentially high false positive rate (see
          below).
          Empirical R:S is compared to neutral R:S by means of a
          two-sided test of the binomial distribution. For each
          codon, we test the null hypothesis that all nine point
          mutants are equally probable. The quotient 
          p = R/(R+S) is the probability of a
          replacement mutation at this codon if each nucleotide is
          equally mutable and each of the three mutational targets
          at that codon are equally likely. The numerator, R, is
          the number of point mutations that lead from the
          consensus codon to the target amino acid. The chance of
          observing 
          r replacement mutations is given by
          the binomial distribution, , where 
          n is the number of codons providing
          data for this position. To form a two-sided test, we sum
          all terms 
          b ( 
          kn ,, 
          p ) such that 
          b ( 
          kn ,, 
          p ) is not greater than 
          b ( 
          rn , 
          p ,), where 
          k is in the set (0,..., 
          n ) and 
          r is the number of observed
          replacement mutations. In other words, we sum the chances
          of all events that are no more likely than that of the
          observation. If this sum, α, is small ( 
          e.g. , not greater than 0.05), we
          reject the null hypothesis at the α level of
          significance.
        
        
          Working example
          We analyze the following scenario as a working example
          (Table 1).
          Additional file 1
          Click here for file
          Additional file 2
          Click here for file
          Additional file 3
          Click here for file
          The consensus codon is given as ttt (Phe; Table 1,
          column 1). Each observed mutation is also given (Table 1,
          columns 1-3)
          Because we know the frequency of silent mutations
          (given as 10 in this example; Table 1, column 3), we also
          know the expected R:S for each replacement mutation
          (Table 1, column 4). That is, if selection is neutral for
          any particular replacement mutation, we can calculate the
          incidence of each replacement mutation we expect to
          observe (by looking at a table of the genetic code).
          Using the given frequency of each mutation observed,
          we also know what the observed R:S is in each case (Table
          1, column 5).
          Now we use a two-tailed test of the binomial
          distribution to determine if each observed R:S is
          significantly different from the corresponding expected
          R:S (Table 1, column 6). In some cases, the differences
          are significant, in which case the appropriate
          replacement mutation is assigned positive or negative
          selection (positive if the observed R:S is larger than
          expected and negative if the observed R:S is lower than
          expected). Otherwise, the selection is assigned to be
          neutral drift.
          The QUASI algorithm thus indicates at this exemplary
          codon that both tyrosine and serine are positively
          selected; isoleucine, leucine, and cystine are negatively
          selected; and the selective advantage or disadvantage of
          valine is indistinguishable from neutral drift. Any other
          ttt codon will have its own selective pressures assessed
          in a similar but independent testing procedure.
        
        
          Minimum sequences
          For each possible replacement mutation, a minimum
          number of mutations will need to be observed for a
          selective event to be detected. This minimum number
          differs depending on the consensus codon and the level of
          significance. We have calculated the minimum number of
          mutations needed to achieve the 5% significance level. At
          the lower bound of this range, only 2 replacement
          mutations will be required to detect positive selection
          for any replacement mutation from cta or ctg. Any
          replacement mutation from cta or ctg has an expected R:S
          of 1:4, and thus an observed R:S of 2:0 will be
          sufficient to reject neutral drift in favor of positive
          selection. Conversely, detecting negative selection at a
          cta or ctg codon is difficult. At the upper bound of the
          range, a minimum of 17 observed mutations are required to
          detect negative selection at such a codon (0:17 is
          significantly lower than 1:4). For the most-typical
          codon, the expected R:S is 1:3. For these modal codons,
          at least 3 mutations must be observed to detect positive
          selection ( 
          i.e. , if observed R:S = 3:0). At
          the same most-typical codon, 12 mutations are required,
          at a minimum, to detect negative selection ( 
          i.e. , if observed R:S = 0:12).
          Because identification of positive selection is generally
          the goal, the QUASI algorithm appears to have a practical
          advantage over extant selection detectors, which require
          either many more mutations or a biased expectation of
          neutral drift in order to detect positive selection.
        
        
          False-positive testing
          False positives are likeliest when drift is completely
          neutral ( 
          e.g. , as was found by Suzuki and
          Gojobori [ 36]). One may estimate the maximum frequency
          of false positives by testing simulated sequences
          generated under neutral drift parameters. We used the
          EVOLVER program of the PAML package [ 54] to generate
          sequences drifting under neutral Jukes-Cantor evolution.
          For each simulation, we generated 300 related sequences
          of length 999 and with average branch lengths varying in
          0.1 length intervals from 0.1 to 1.0; each parameter set
          was used to generate 10 sets of 300 sequences. False
          positive percentages [Fig. 5; false positive percentage =
          false positives / (false positives + neutral drift
          variants)] were extremely low (∼ 2%) for relatively long
          branch lengths (0.1-1.0). Extremely high false positive
          rates (up to 70%) were found for extremely short branch
          lengths (maximum at 0.001). As accurate branch lengths
          are calculated using maximum likelihood phylogeny, these
          branch lengths may be used to find the appropriate
          false-positive percentage. For instance, if the branch
          length were 0.01 [as appears appropriate for HA
          (unpublished observation)], then the maximum false
          positive rate would be estimated at 32%. We then use the
          calculated neutral drift frequency (from QUASI) as an
          estimator for the maximum false positives. If (as we
          report) HA is found to have 5 neutral drift variants,
          then we would expect a maximum of 2.3
          false-positives.
        
        
          Selection mapping
          QUASI presents its results in the following
          format:
          1. The consensus amino acids are written in capital
          letters.
          2. Beneath each consensus amino acid are written in
          capital letters all variants determined to be positively
          selected (in descending order of frequency).
          3. The negatively selected variants are not shown.
          4. In lowercase letters and interspersed according to
          their frequencies among the positively selected variants
          are variants where the neutral drift null hypothesis
          cannot be rejected with the given sequence data. As a
          reasonable but arbitrary cut-off, we include apparently
          unselected variants if they are among the 2 H+σmost
          frequent variants, where 
          H is the Shannon information
          content of the site and σ is the standard error of its
          estimation [ 55]. is the 
          i th fraction of amino acids at the
          site (the alignment gap is counted as a 21st amino acid).
          For the Shannon calculation, alignment gaps are
          considered distinct from "no data" gaps (artifacts of
          indeterminate sequencing or sequence fragment overlaps;
          such data absences are excluded from calculation).
        
        
          Sequences
          Nucleotide sequences were downloaded from GenBank at
          the NIH. Sequences were included only if they were
          isolated in the field and were not obviously pseudogenes
          (sequences containing premature stop codons were removed
          from consideration). We analyzed 310 sequences of
          human-infective influenza A (H3 serotype), 6,151 HIV-1
          gp120 sequences, and 400 HIV-1 RT sequences. All
          sequences were pre-aligned with PILEUP [ 56] and/or
          DIALIGN2 [ 57] then hand-corrected.
        
      
      
        Abbreviations
        HIV, human immunodeficiency virus; HA, hemagglutinin;
        Ab, antibody; gp120, glycoprotein 120; RT, reverse
        transcriptase; R:S, replacement to silent mutation ratio;
        CD4BS, CD4 binding site epitope; CD4i, CD4-induced epitope;
        MHC, major histocompatibility complex; AZT,
        3'-azido-3'-deoxythymidine.