Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download
29547 views
1
2
3
4
5
Introduction
6
For many years, scientists believed that point mutations
7
in genes are the genetic switches for somatic and inherited
8
diseases such as cystic fibrosis, phenylketonuria and
9
cancer. For this to be the case, disease-associated amino
10
acid substitutions should occur in functionally important
11
regions of the protein products of genes. While it has been
12
shown in specific cases that disease-associated amino acid
13
substitutions affect protein function, until now few
14
studies have examined this across many genes. Here we
15
provide direct evidence that disease-associated point
16
mutations occur in functionally important regions of the
17
genome and are not distributed equally across the coding
18
regions of genes. This work supports recent efforts to
19
collect disease associated mutational data in databases and
20
suggests that many of the mutations represented in those
21
databases are the likely underlying molecular cause of
22
disease.
23
Recently there have been a number of commercial and
24
public projects aimed at collecting and understanding human
25
genomic variation [ 1 ] . The goal of these projects is to
26
provide an understanding of how genotype is associated with
27
disease, how it affects our response to drugs and how it
28
affects the protein products of genes. Examples of these
29
projects include the SNP Consortium, the Human Genome
30
Mutation Database [ 2 ] , many gene specific databases ( [
31
3 4 ] , for example), and both public and private genome
32
sequencing efforts [ 5 ] . Much of the data that is being
33
collected are mutations annotated with their observed
34
phenotype. Automated annotation methods based on structural
35
and evolutionary parameters can lead to insight into the
36
molecular basis of disease.
37
With more than 4,000,000 identified variations and with
38
over 20,000 of them annotated with a phenotype, we are
39
facing the problem of having many uncharacterized
40
mutations. Algorithms are needed for automatically
41
annotating these gene variations to gain insight into how
42
they affect the gene's regulation and/or function of its
43
protein products. Using many collection technologies,
44
uncharacterized SNP data is being placed in public
45
databases such as the Human Genome Mutation Database (over
46
20,000 entries) [ 2 ] and the National Cancer Institute's
47
CGAP-GAI (Cancer Genome Anatomy Project Genetic Annotation
48
Initiative) [ 6 ] . The CGAP-GAI group has identified
49
10,243 SNPs by examining publicly available EST (Expressed
50
Sequence Tag) chromatograms.
51
Software for analyzing unannotated SNPs in known disease
52
associated genes will be especially useful when previously
53
unobserved mutations are discovered. Every human has
54
genotypic differences from the standard genome
55
approximately every thousand base pairs [ 7 8 9 ] . Given
56
knowledge of how a genotype differs from the standard, it
57
is important to be able to predict which of the variations
58
are likely to be the cause of disease or other phenotypic
59
differences. Evolutionary information about regulatory and
60
coding regions of genes can be used to highlight certain
61
mutations or groups of mutations that are attributable to a
62
phenotype [ 10 11 12 ] .
63
Early tools using phylogenetic and structural
64
information have shown promise in predicting the functional
65
consequences of a mutation [ 13 ] . These reports predict
66
that anywhere between 20-36% of non-synonymous SNPs alter
67
the function of a gene's protein product. In the report by
68
Chasman and Adams, evolutionary information was predicted
69
to be a useful component in determining whether a mutation
70
is deleterious [ 13 14 ] . Disease causing mutations are
71
also likely structurally perturbing at the protein level [
72
15 ] . Ng and Henikoff have introduced SIFT, a method for
73
predicting functional SNPs from a database of unannotated
74
polymorphisms [ 16 17 ] .
75
The relationship between disease-associated mutation
76
positions and evolutionary conservation has been reported
77
in specific cases. An analysis of the breast and ovarian
78
cancer susceptibility gene, BRCA1, showed that
79
disease-associated mutations tend to occur in highly
80
conserved regions [ 18 ] . An analysis of homologous
81
sequences in the androgen receptor has shown similar
82
results [ 10 ] . Keratin 12, KRT12, is associated with
83
Meesmann Corneal Epithelial Dystrophy (MCD). Reported
84
mutations often occur in the highly conserved
85
alpha-helix-initiation motif of rod domain 1A or in the
86
alpha-helix-termination motif of rod domain 2B [ 19 ] .
87
Structure based analysis methods have also been used to
88
analyze Osteogenesis imperfecta associated COL1A1 mutations
89
and disease-associated P53 mutations (Mooney and Klein,
90
unpublished), [ 20 ] . Miller and Kumar have reported that
91
disease-associated mutations are conserved in seven model
92
genes [ 21 ] .
93
To determine the degree to which mutation positions
94
differ evolutionarily from other positions, we have built
95
alignments of homologous genes for 231 disease-associated
96
genes. These multiple alignments have then been used to
97
assess the difference in evolutionary conservation for
98
positions that are both disease-associated and not
99
associated. The results show that, in general, positions
100
with disease-associated mutations are conserved more than
101
the average position in the alignment. This suggests the
102
most conserved mutations are likely to be the causative
103
agents of disease, and our data set identifies these
104
mutations.
105
106
107
Results and Discussion
108
Our method compares the negative entropy of
109
disease-associated columns within an alignment to other
110
columns in that alignment. The goal of this work is to
111
build these alignments, map the mutations to them, and show
112
that disease-associated positions are, in general,
113
conserved. The analysis was performed on the built
114
alignments and the results are shown in Table 1.
115
To collect the mutation data, 231 genes were used for
116
the analysis. They were chosen because they had a reported
117
cDNA sequence, disease-associated mutations and homologs in
118
SWISSPROT. These genes are listed in Table 1. Each
119
alignment consists of all the homologs in SWISSPROT as
120
determined by a BLAST search with an e-value threshold of
121
10e-15. For each alignment the negative entropy for each
122
column was calculated.
123
The conservation ratio parameter is defined as the
124
average negative entropy of analyzable positions with
125
reported mutations divided by the average negative entropy
126
of every analyzable position in the gene sequence. Analysis
127
was performed on 231 genes and 6185 mutations and of those
128
we found that 84.0% had conservation ratios less than one.
129
From those, 139 genes had more than ten analyzable
130
mutations and, of those, 88.0% had conservation ratios less
131
than one.
132
Use of evolutionary information is a promising approach
133
to automated characterization of mutations. These results
134
show that although conservation alone is not a perfect
135
predictive measure, there is useful information contained
136
in sequence alignments containing homologous genes.
137
Approaches using conservation in a multiple alignment
138
should work better when associated with other methods such
139
as structural analysis, population analysis and
140
experimental data. Knowledge of how the sequence pool
141
clusters into families may increase the sensitivity of the
142
method.
143
Our measured parameter, the conservation ratio, is a
144
quantity that measures the usefulness of a multiple
145
alignment for characterizing mutations in a gene sequence.
146
Knowledge of more mutations in a gene does not necessarily
147
lower the conservation ratio. We expect that knowledge of
148
more mutations in a gene will increase the statistical
149
significance of the conservation ratio. This is the likely
150
underlying cause of the result showing that genes with 10
151
or mutations increases are more likely to have a
152
conservation ratio less than one.
153
The alignments and BLAST searches are integrated on the
154
website, http://cancer.stanford.edu/mut-paper/.
155
156
157
Conclusions
158
In conclusion, there are estimated to be 30,000
159
non-synonymous differences between an individual and the
160
draft genome [ 23 5 7 8 9 ] . Determination of which
161
positions are likely to be disease associated is a
162
challenging and important problem. The finding that
163
disease-associated mutations occur in positions of
164
functional importance supports recent efforts for the
165
building of methods to predict which positions are likely
166
to be disease associated [ 14 13 16 17 ] . These methods
167
are likely to incorporate protein structure, the amino acid
168
identity of the mutation and phylogenetic information. In
169
an interesting twist, this observation also suggests that
170
this data may be useable as a functional genomics tool for
171
understanding the function of the protein products of genes
172
on a molecular level. Such a method would use the inherent
173
functional information contained in a phenotypically
174
annotated polymorphism to infer functional importance
175
within a gene.
176
177
178
Methods
179
Non-synonymous mutations were acquired from the Human
180
Genome Mutation Database
181
http://archive.uwcm.ac.uk/uwcm/mg/hgmd0.html. 231 genes
182
were chosen with known disease-association each having
183
SWISSPROT homologs, a cDNA sequence and mutations in the
184
coding region. For each of those genes, all known
185
non-synonymous mutations were then downloaded with the cDNA
186
sequence for that gene.
187
Each cDNA sequence was then translated and placed in a
188
FASTA formatted file. For each of the resultant files a
189
BLAST [ 24 ] search was performed against the SwissProt
190
database. All sequences from the returned hits were then
191
stored in FASTA format files. For each of the genes that
192
returned BLAST results with e-value scores smaller than
193
1e-15, ClustalW [ 25 ] was used to build a sequence
194
alignment.
195
For each amino acid in the position of interest, the
196
negative entropy was determined using the following formula
197
[ 26 ] :
198
199
Where the P
200
i are the probabilities of finding a
201
particular amino acid at that position. For this analysis,
202
gapped positions, "-", were considered independent amino
203
acids.
204
For each known mutation, the negative entropy of the
205
column it occupies was tabulated. The average negative
206
entropy for each mutation within a gene was compared to the
207
average entropy of all columns satisfying the criteria for
208
analysis. Mutations outside of the coding region or
209
mutations encoding termination codons were discarded.
210
The list of genes was then sorted by average negative
211
entropy of the mutations. We then calculated the
212
conservation entropy, CE, using:
213
CE = average NE of mutation positions/average NE of all
214
positions in the gene sequence
215
216
217
218
219