PROVEAN
After many years of supporting PROVEAN beyond grant funding the web-based version has been retired. While unsupported, it may still be downloaded.
PROVEAN (Protein Variation Effect Analyzer) is a software tool which predicts whether an amino acid substitution or indel has an impact on the biological function of a protein.
PROVEAN is useful for filtering sequence variants to identify nonsynonymous or indel variants that are predicted to be functionally important.
The performance of PROVEAN is comparable to popular tools such as SIFT or PolyPhen-2.
A fast computation approach to obtain pairwise sequence alignment scores enabled the generation of precomputed PROVEAN predictions for 20 single AA substitutions and a single AA deletion at every amino acid position of all protein sequences in human and mouse.
Funding
This work is funded by the National Institutes of Health, grant number 5R01HG004701-04.
Related
Human genome
Other variant prediction tools
Related Research
Publications
Bioinformatics (Oxford, England). 2015-08-15; 31.16: 2745-7.
PROVEAN web server: a tool to predict the functional effect of amino acid substitutions and indels
PloS one. 2012-01-01; 7.3: e46688.
Predicting the functional effect of amino acid substitutions and indels
Related
Human genome
Other variant prediction tools
Related Research
The results below are based on PROVEAN v1.1.
PROVEAN Score
PROVEAN was developed to predict whether a protein sequence variation affects protein function.
PROVEAN is able to provide predictions for any type of protein sequence variations including the following.
- Single or multiple amino acid substitutions
- Single or multiple amino acid insertions
- Single or multiple amino acid deletions
(a) Overview
An overview of the PROVEAN procedure is shown in Figure 1. Briefly, clustering of BLAST hits is performed by CD-HIT with a parameter of 75% global sequence identity. The top 30 clusters of closely related sequences form the supporting sequence set, which will be used to generate the prediction.
A delta alignment score is computed for each supporting sequence. The scores are then averaged within and across clusters to generate the final PROVEAN score. If the PROVEAN score is equal to or below a predefined threshold (e.g. -2.5), the protein variant is predicted to have a "deleterious" effect. If the PROVEAN score is above the threshold, the variant is predicted to have a "neutral" effect.
Figure 1. Computing the PROVEAN score.
(b) Setting the PROVEAN score threshold
PROVEAN introduces a delta alignment score based on the reference and variant versions of a protein query sequence with respect to sequence homologs collected from the NCBI NR protein database through BLAST. The PROVEAN score distribution of a set of 58K UniProt human protein variants with known functional outcome is shown in Figure 2.
For maximum separation of the deleterious and neutral variants for all 4 classes of human protein variants, the default score threshold is currently set at -2.5 for binary classification (i.e. deleterious vs neutral).
Figure 2. PROVEAN (v1.1) score distribution for 58K deleterious and neutral UniProt human protein variations.
(c) How to increase stringency of detection
To increase the specificity of detection (i.e. find deleterious variants with higher confidence), a lower score threshold can also be used (e.g. -4.1). To increase the sensitivity of detection (i.e. find more deleterious variants by including those with lower confidence), a higher score threshold can also be used (e.g. -1.3). A summary of specificity and sensitivity obtained using different score cutoffs is shown in Figure 3.
Figure 3. Sensitivity and specificity at different PROVEAN score cutoffs.
(d) PROVEAN v1.1 compared to the previous version (v1.0)
In order to refine the supporting sequence set, some conditions were added and parameters were changed accordingly in the process of selecting supporting sequences.
- In order to be used as the supporting sequence, a sequence must be at least 30% of the length of query sequence.
- While clustering, each sequence in a cluster must be at least 30% of the length of its representative sequence.
- With 75% global sequence identity, top 30 clusters were selected.
PROVEAN Performance
The overall accuracy for binary classification of protein variants (deleterious or neutral, classification property is being deleterious):
- 79.5% for UniProt human protein variations (download)
- 77.9% for UniProt non-human protein variations (other mammals, plants, bacteria, etc.) (download)
Table 1. Number of human protein variations and prediction accuracy of PROVEAN v1.1.
Variant types | Functional outcome | Peformance(%) | |||||
---|---|---|---|---|---|---|---|
Deleterious | Neutral | Total | Sensitivity | Specificity | Accuracy* | Balanced accuracy** | |
Single amino acid substitutions | 20821 | 36825 | 57646 | 78.85 | 79.55 | 79.29 | 79.20 |
Deletions | 652 | 77 | 729 | 96.78 | 66.23 | 93.55 | 81.51 |
Insertions | 110 | 61 | 171 | 91.82 | 78.69 | 87.13 | 85.25 |
Replacements | 79 | 59 | 138 | 93.67 | 64.41 | 81.16 | 79.04 |
Total | 21662 | 37022 | 58684 | 79.51 | 79.49 | 79.50 | 79.50 |
Table 2. Number of non-human protein variations and prediction accuracy of PROVEAN v1.1.
Variant types | Functional outcome | Performance (%) | |||||
---|---|---|---|---|---|---|---|
Deleterious | Neutral | Total | Sensitivity | Specificity | Accuracy* | Balanced accuracy** | |
Single amino acid substitutions | 14117 | 16498 | 30615 | 80.33 | 76.00 | 78.00 | 78.17 |
Deletions | 142 | 227 | 369 | 82.39 | 61.23 | 69.38 | 71.81 |
Insertions | 34 | 137 | 171 | 76.47 | 75.18 | 75.44 | 75.83 |
Replacements | 886 | 1029 | 1915 | 86.79 | 63.46 | 74.26 | 75.13 |
Total | 15179 | 17891 | 33070 | 80.72 | 75.09 | 77.67 | 77.90 |
Note:
- *Accuracy=(TP + TN)/(TP + TN + FP + FN); affected by potential unequal sizes of deleterious and neutral datasets
- **Balanced Acurracy=(Sn + Sp)/2; unaffected by potential unequal sizes of deleterious and neutral datasets
- A score threshold of -2.583 was used.
TP: true positive; TN: true negative; FP: false positive; FN: false negative; Sn: sensitivity; Sp: specificity
Comparison with Other Tools for Single Amino Acid Substitutions
(a) Prediction accuracy of different tools
The performance of PROVEAN was compared with other prediction tools: Mutation Assessor, SIFT, PolyPhen-2, and Condel, using default score thresholds suggested by individual tools. Overall, the performance of PROVEAN is comparable to other tools as shown in Tables 3 and 4.
Table 3. Prediction accuracy of PROVEAN, Mutation Assessor, SIFT, PolyPhen-2, and Condel for the UniProt human protein variant datasets.
Prediction tools | Score thresholds | Human dataset | References | ||||
---|---|---|---|---|---|---|---|
Sensitivity | Specificity | Accuracy* | Balanced accuracy** | No prediction | |||
PROVEAN | -4.100 (default) -2.500 -1.300 |
56.17 79.76 90.28 |
90.34 78.63 61.65 |
78.00 79.04 71.99 |
73.25 79.19 75.96 |
0 | Choi et al., 2012; web |
Mutation Assessor | 0.800 1.90 |
96.54 85.29 |
40.59 71.02 |
60.90 76.20 |
68.57 78.15 |
317 (0.55%) |
Reva et al., 2011; web |
SIFT | 0.050 | 85.03 | 68.95 | 74.77 | 76.99 | 1147 (1.99%) |
Kumar et al., 2009; web |
PolyPhen-2 | 0.432 | 88.68 | 62.45 | 72.00 | 75.56 | 2279 (3.95%) |
Adzhubei et al., 2010; web |
Condel web server | 0.469 | 93.84 | 46.23 | 64.67 | 70.04 | 7194 (12.48%) |
González-Pérez and López-Bigas, 2011; web |
Table 4. Prediction accuracy of PROVEAN, Mutation Assessor, SIFT, and PolyPhen-2 for the UniProt non-human protein variant datasets.
Prediction tools | Score thresholds | Non-human dataset | ||||
---|---|---|---|---|---|---|
Sensitivity | Specificity | Accuracy* | Balanced accuracy** | No prediction | ||
PROVEAN | -4.100 (default) -2.500 -1.300 |
59.76 81.11 89.39 |
86.67 75.22 61.63 |
74.26 77.94 74.43 |
73.22 78.17 75.51 |
0 |
Mutation Assessor | 0.800 1.90 |
93.17 81.30 |
45.13 67.16 |
67.07 73.62 |
69.15 74.23 |
732 (2.39%) |
SIFT | 0.050 | 87.47 | 69.27 | 77.88 | 78.37 | 1542 (5.04%) |
PolyPhen-2 | 0.432 | 87.77 | 65.81 | 76.11 | 76.79 | 1499 (4.90%) |
Note:
- The UniProt human and non-human protein variant datasets were used.
- The comparison was performed only for single amino acid substitutions because other tools do not handle other types of variation such as amino acid insertion or deletion.
- The "No prediction" column shows the number of variants for which the corresponding tool fails to provide a prediction.
- PROVEAN v1.1, Mutation Assessor web server v1.0, SIFT v4.0.3, PolyPhen-2 webserver v2.1.0, and Condel web server v1.4 were used.
- *Accuracy=(TP + TN)/(TP + TN + FP + FN); affected by potential unequal sizes of deleterious and neutral datasets
- **Balanced Acurracy=(Sn + Sp)/2; unaffected by potential unequal sizes of deleterious and neutral datasets
TP: true positive; TN: true negative; FP: false positive; FN: false negative; Sn: sensitivity; Sp: specificity
(b) Prediction consistency among different tools
The prediction results for the UniProt human protein variant dataset were obtained from multiple tools including PROVEAN, SIFT, and PolyPhen-2, and are summarized in Venn diagrams shown in Figure 4. Overall, the prediction results for 78.5% (15,618/19,898) of disease-associated variants and 46.8% (16,244/34,701) common variants are in agreement and shared by all three tools. In addition, each tool produces correct predictions for distinct subsets of disease or neutral variants. Combining prediction results from multiple tools can increase the chance of identifying functional variants that had been missed by other tools.
Figure 4. Venn diagrams showing predictions from PROVEAN, SIFT, and PolyPhen-2 for the UniProt human protein variant dataset (score thresholds used: PROVEAN, -1.3; SIFT, 0.05; PolyPhen-2, 0.432).
Note:
The number and percentage of correctly predicted variants are shown next to each tool.
Related
Human genome
Other variant prediction tools
Related Research
PROVEAN software
- PROVEAN source code
- README for installation and usage instructions
- LICENSE
PROVEAN precomputed scores
- PROVEAN scores (v1.1) on all possible single AA substitutions and deletions in human proteins from Ensembl 66 (SIFT scores are also available for single AA substitutions)
- PROVEAN scores (v1.1) on Human dbSNP 137 coding variants
- PROVEAN scores (v1.0) on UniProt human polymorphisms and disease mutations dataset
(Release: 2011_09, 20821 disease variants and 36825 polymorphisms) - PROVEAN scores (v1.0) on indel dataset collected from UniProt/Swiss-Prot database (Release: 2011_09)
- PROVEAN scores (v1.0) on indel dataset collected from 1000 Genomes Project (Release: August 2010)
Prerequisite software and database
- NCBI BLAST (download)
- CD-HIT (download, we recommend not using v4.6 or v4.6.1 (known issue))
- NCBI nr protein database (download latest release or download August 2011 release used in our publication)