Yum, tasty mutations...

Mutation T@ster


     models and training  |   cross-validation  |   cross-comparison with other tools  |   application to an exome

models and training


MutationTaster2 still uses 3 different models, one for alterations that do not cause any amino acid ( without_aee), one for simple substitutions ( simple_aae), and one for those changes that cause more complex changes in the aa sequence of the resulting peptide, such as a frameshift or a shifted start ATG ( complex_aae).
See the models we used (numbers of training cases and frequencies).

training sets

The following table shows the composition of the data sets used to train the classifier. We used all available alterations that fulfilled our criteria (status 'disease mutation' in HGMD Pro or polymorphism confirmed by at least 4 carriers of this genotype in the 1000 Genomes Project (TGP) - variants that appeared in both groups were discarded). For the web-version, the classifier was trained with all alterations suitable for the given model; alterations of the less frequent type were fed several times into the training to reach equal frequencies of disease mutations and polymorphisms. See below for the composition of the cross-validation data sets.
model n (polymorphisms) n (disease mutations) comments
without_aae 6807269 122238 each disease mutation was used 55 times in the training of the web version (56 times in the cross-validation)
simple_aae 20967 151542 each polymorphism was used 7 times in the training of the web version and in the cross-validation
complex_aae 2340 123213 each polymorphism was used 52 times in the training of the web version (57 times in the cross-validation)


We cross-validated MutationTaster2 five times for each of the models. For these cross-validations, all but 4000 alterations suitable for the model were used to train the classifier. The disease potential of the remaining 4000 alteration (2000 disease mutations and 2000 polymorphisms) was then predicted by the classifier. As the number of known polymorphisms leading to complex changes ( complex_aae model) is still very low, we could only use 400 alterations falling into this class as test set. This explains the relatively high standard deviation.
The additional features used for the automatic classification of variants (such as presence in the TGP data with a reasonable number) were of course neglected, so the predictive performance for real data will be even better than the numbers shown here.

results of the cross-validation

model n accuracy accuracy
(disease mutations)
sensitivity specificity NPV PPV
simple_aae 4000 0.886 0.895 0.877 0.879 0.893 0.877 0.895
± 0.004 0.005 0.008 0.007 0.004 0.008 0.005
without_aae 4000 0.922 0.888 0.957 0.954 0.895 0.957 0.888
± 0.004 0.006 0.004 0.004 0.005 0.003 0.006
complex_aae 400 0.907 0.944 0.869 0.879 0.939 0.869 0.944
± 0.017 0.004 0.032 0.026 0.005 0.032 0.004
Below the values for each model, the standard deviation (obtained from 5 runs), is shown. See the results of each single run.


We compared MutationTaster2 with PolyPhen-2 (HumVar and HumDiv model), SIFT, PROVEAN, and MutationTaster1. As SIFT and PolyPhen-2 can only handle single amino acid substitutions, we restricted the test set to suitable mutations. For this test, the web version of MutationTaster (which was trained with all available polymorphisms and disease mutations) was used. The additional features used for the automatic classification of variants (such as presence in the TGP data with a reasonable number) were of course neglected.
We generated two test sets, each containing 3600 alterations leading to single amino acids substitutions with a known pathogenicity status:
1 1800 known disease mutations from HGMD Pro (disease state = DM [disease mutation])
1800 harmless polymorphisms from the 1000 genomes project (each of the 3 possible genotypes found in at least 50 samples)*
2 1800 known disease mutations from ClinVar (disease state = pathogenic) [not with MutationTaster1]
1800 harmless polymorphisms from the 1000 genomes project (each of the 3 possible genotypes found in at least 50 samples) [not with MutationTaster1]*
* We used the same polymorphisms in both sets.
Please note that MutationTaster1 was added later due to a request of our reviewers and is so far included only in the HGMD-based comparison. MutationTaster1 uses Ensembl59 and we could not always find the same transcripts that were used in the initial comparison. We hence had to reduce the number of test cases in the HGMD-based comparison!

We submitted these alterations to the web interfaces of PolyPhen-2 (HumVar and HumDiv), SIFT, PROVEAN, MutationTaster2, and MutationTaster1 by entering the DNA change via the chromosomal position. Since these variants (on DNA level) might cause different substitutions in different transcripts, we extracted the one result corresponding to the amino acid exchange from the testset. If there were several results for the amino acid exchange in question, we used the first result. Since some variants could not be analysed by all programs (or at least did not return the required amino acid substitution), we randomly selected 1300(ClinVar)/1100(HGMD) disease mutations and 1300(ClinVar)/1100(HGMD) polymorphisms out of the 2381 (HGMD) or 2814 (ClinVar) variants for which all programs gave predictions. We then compared the predictions for these 2600/2200 test cases.

results of the cross-comparison

programme total TP TN FP FN NPV PPV sensitivity specificity accuracy
1000 genomes and HGMD Pro
PPH2-var 2200 868 976 124 232 80.8% 87.5% 78.9% 88.7% 83.8%
PPH2-div 2200 944 903 197 156 85.3% 82.7% 85.8% 82.1% 84.0%
PROVEAN 2200 856 966 134 244 79.8% 86.5% 77.8% 87.8% 82.8%
SIFT 2200 910 944 156 190 83.2% 85.4% 82.7% 85.8% 84.3%
MT1 2200 931 961 139 169 85.0% 87.0% 84.6% 87.4% 86.0%
MutationTaster2 2200 976 961 139 124 88.6% 87.5% 88.7% 87.4% 88.0%
1000 genomes and ClinVar
PPH2-var 2600 1108 1159 141 192 85.8% 88.7% 85.2% 89.2% 87.2%
PPH2-div 2600 1175 1076 224 125 89.6% 84.0% 90.4% 82.8% 86.6%
PROVEAN 2600 1096 1146 154 204 84.9% 87.7% 84.3% 88.2% 86.2%
SIFT 2600 1136 1123 177 164 87.3% 86.5% 87.4% 86.4% 86.9%
MutationTaster2 2600 1213 1132 168 87 92.9% 87.8% 93.3% 87.1% 90.2%
tp: true positive; tn: true negative; fp: false positive, fn: false negative; NPV = negative prediction value = tn / (tn + fn); PPV = positive prediction value = tp / (tp + fp); sensitivity = tp / (tp + fn); specificity = tn / (tn + fp); accuracy = (tp + tn) / (tp + tn + fp + fn)

MutationTaster2 displays automatic predictions for known harmless polymorphisms from the 1000 Genomes Project and known disease mutations from NCBI ClinVar. For the comparison of MutationTaster2 with the other tools, we did not consider the automatically displayed prediction (which is per se correct - or at least the same as our prior) but the actual prediction made by the classifier, which is reflected by the probability-value. (An automatic prediction with a probability below 0.5 means that MutationTaster2 would predict the other case if it would not consider the known outcome of this variant.) The performance of MutationTaster on real life data, where known polymorphisms / disease mutations are recognised, is hence even better than the accuracy shown here. Please see our example using a real exome to evaluate the use of MutationTaster!

These results suggest a bias towards mutations with a more obvious effect on the protein in ClinVar (as of February 2013) because all programs perform better on the ClinVar data set. See the ClinVar set and the single results.
For copyright reasons, we are unable to reproduce the list of disease mutations obtained from HGMD Professional as text, but we can offer the predictions for the disease mutations and polymorphisms as images. We also provide the results as a comprehensive text file, but here the identities of the HGMD disease mutations are replaced by sequential numbers.
We also provide detailed statistics about the consensus among the different tools for the HGMD data set.

ROC plot

ROC curve for the HGMD-based comparison

This plot shows the receiver operating characteristics for the comparison using the HGMD/TGP dataset. Please note that these plots are intended to set a threshold to discriminate signals from noise - or, in case of score-based predictiors, to find an optimal cut-off value between disease mutations and polymorphisms.
MutationTaster2 does not return a 'score' but only a boolean prediction (disease causing or not) plus a confidence score for this prediction. This kind of plot is hence not very useful to determine the performance of MutationTaster2 (it would indeed be very useful to determine cut-off values for continuous values, e.g. for predictions only based on PhyloP or PhastCons).
We know, however, that in many articles ROC curves are used to compare tools such as MutationTaster2 and hence include these curves to show that MutationTaster2 has, as its predecessor, a higher area under curve (AUC).

This plot was generated using R. We loaded the prediction (or confidence) scores of the different programs and the disease status for each alteration into the ROCR package. In the case of PROVEAN and SIFT, where a decreasing score indicates higher disease potential, we multiplied the scores by -1.

application to a real exome

To evaluate the performance of MutationTaster2, especially the false positive rate (FPR), we have sent all exonic variants found in a 1000 genotypes sample to MutationTaster2, PolyPhen-2, SIFT, and Provean. In the first step, the variants were extracted from the BAM alignment file of sample HG00377 using samtools/bcftools:
samtools mpileup -D -gf /Ensembl69/Homo_sapiens.GRCh37.69.dna.all.fa HG00377.mapped.ILLUMINA.bwa.FIN.exome.20121211.bam | bcftools view -c -g -v - > Exome_HG00377.vcf

A list of all variants within exons (+/- 10 flanking bases) was then obtained with a Perl script. This list was then sent to MutationTaster2's Query Engine and to the web services of PolyPhen-2, and SIFT/PROVEAN.
The results obtained from the different tools were written to a database table; in case of more than one prediction for a variant (due to mutiple transcripts), the most deleterious score was used. From this database table, we extracted all predictions for homozygous variants with a coverage of 10 or higher. Each table contains two parts; on top we list all predictions, below only those cases that were predicted by all tools.
Because SIFT and PolyPhen can only predict the outcome of single amino acid substitutions, we created another table that contains only these cases. For this, we extracted the predicted amino acid substitutions from MT2 and SIFT and included only those cases that were predicted to cause such an exchange by both tools. All predictions that assumed pathogenecity (including "possibly damaging” and “probably damaging”) were counted as false positives.
MutationTaster2 does not only give more predictions than the other tools (because it can also handle synonymous substitutions and the flanking bases outside the exons) but even fewer false positives than our competitors. Please note that in this real life example, MutationTaster's automatic classification routines were used.

Non-synonymous, synonymous, and non-coding variants

all predictions
variants analysed by all tools

only variants leading to single amino acid substitutions

all predictions
variants analysed by all tools
FP: false positives (i.e. pathogenic predictions), TN: true negatives (benign predictions), FPR: false positive rate ( FP/(FP+TP) )


MutationTaster's results can be interactively inspected: http://doro.charite.de/temp/vcf_7890_1559988265/progress.html.
Note that MutationTaster2 gives predictions for all transcripts, inflating the number of results. Please also note that many of the disease predictions are frameshifts which are neglected by the other tools; some are even known disease mutations.

statistics for homozygous / all variants with a coverage of at least 10
the sample exome (HG00377)
PolyPhen-2 results (Settings: HumDiv model, all transcripts)