Mutation T@ster

FAQs

Can I download pre-computed predictions?

Unfortunately not. Unlike SIFT or PolyPhen which handle only single amino acid substitutions, MutationTaster works on DNA level and allows insertions and deletions. The exome alone comprises about 30 Mb with 3 possible single base exchanges at each site (let alone introns and InDels). These 30 M x 3 SBEs may affect several different transcripts, leading to about 30,000,000 (Mb) x 3 (SBEs) x 5 (transcripts) = 450,000,000 values to pre-compute.
We could of course generate such a list, but it would still not include the InDels and most of the introns. What is more important: such a list would take a very long time to generate and might soon become outdated. We rather spend our efforts on improving MutationTaster!

Why doesn't MutationTaster know my valid transcript ID?

We filter out protein-coding transcripts (Ensembl biotype protein_coding) without a correct start codon (ATG) and correct stop codon (TGA, TAA, ATG). This might lead to the phenomenon that MutationTaster complains about "no suitable transcripts" or "no transcripts for this gene found" although there are some listed in Ensembl. We decided to exclude such transcripts from analysis in MutationTaster due to their bad annotation, which might in the end lead to a wrong prediction. Transcripts of mitochondrial genes are not tested for integrity due to differences in the mitochondrial genetic code.

What do the prob values mean?

The prob value is the probability of the prediction, i.e. a value close to 1 indicates a high 'security' of the prediction. Please note that the prob value used here is NOT the p value (probability of error) as used in t-test statistics. Please read the next two questions!

What does a probability below 0.5 mean?

Probabilities below 0.5 occur if the automatic prediction for a variant differs from the classification MutationTaster would have made. If an alteration is a 'true' SNP (as confirmed by the existence of each of the three genotypes AA, AB, BB in the HapMap data or by presence in TGP in homozygous state in > 4 cases), it is automatically predicted to be a polymorphism. Alterations that are known disease mutations (as reflected by the 'pathogenic' flag in ClinVar) or which lead to a premature termination codon (and eventually to nonsense-mediated mRNA decay (NMD)) are automatically assigned the 'disease causing' status. In both cases, the Bayes classifier is run nevertheless and the probability for the prediction that was automatically made is shown (a probability <0.5 hence indicates that MutationTaster would have come to a different conclusion).

Does a high probability indicate a high probability for a correct prediction, then?

Unfortunately not. Our results show that wrong predictions are usually not reflected by low probabilities but are rather caused by polymorphisms or disease causing alterations that show characteristics of the other case, e.g. SNPs that are highly conserved and destroy protein features or disease mutations that appear to have no effect on the protein/gene at all.

Why don't you exclude known SNPs as possible disease mutations?

Because many variants listed in db SNP have never been shown to show all three genotypes in unaffected individuals. Some SNPs even appear to have only one allele. And even if both alleles were observed, there should be a sufficient number of healthy individuals who are homozygous for the minor allele to exclude a damaging effect.
If all three genotypes were observed in the HapMap project, or the alteration was found homozygously in the 1000Genomes Project more than 4 times, it will automatically be regarded as polymorphism.

Why don't you use HapMap frequencies to exclude SNPs as possible disease mutations?

We do this now, but we only exclude HapMap SNPs when all three genotypes were observed (see above). Using the allele frequencies alone might lead to the exclusion of SNPs causing rare diseases when homozygous for the minor allele.
Please note that for our cross-validation statistics, only the prediction made by the Bayes classifier - regardless of HapMap genotype counts - were used.

The prediction for my favourite alteration has changed. Why?

Well, this is a very rare event. However, as the available data such as protein features is increasing, we regularly update our database and re-train the classifier. In some cases, the annotation of a gene improves drastically. This may yield formerly unknown protein features in your gene/protein at your position which can of course influence the prediction of your alteration.

Is there any way to learn how a single alteration is classified?

Well, yes and no. A naive Bayes classifier studies the frequencies of single item statuses (such as 'conservation in cattle - highly conserved' or 'existence of a disulfide bond - no') in both groups of the training set ('polymorphisms' / 'disease mutations'). It compares the statuses of these items in your alteration with the known frequencies and then decides which group fits best.
You can of course study the model and hence the frequencies used by the classifier. They can be found in our supplementary data.

What does "InDel alterations are limited to 12 bp" mean?

The website states that with MT "InDel alterations are limited to 12 bp", however, does that mean the ACTUAL insertion or deletion, or the DESCRIPTION of the insertion or deletion? For example, the description of the ALT variant contains 15 bases, whilst the REF variant contains 13 bases, so that the actual INSERTION (AC) here is only 2 bases but it has been rejected for being too long.

When bases get inserted to / deleted from a stretch of similar bases, as in the given example to a stretch of several 'AC', MutationTaster doesn't know at which position exactly the 'AC' was inserted (or deleted), due to the whole stretch of AC. That's why it has to use the whole 13 bases, although the actual insertion is only 2 bases. This is also the reason why such variants are described that way in your VCF file.

Why don't you offer a MutationTaster download version for local installation on my own machine?

We are asked regularly for standalone versions of MutationTaster, our conversion tools or the database. We don't offer these services, because it is not feasible. We would flood the world with lots of different versions of MutationTaster which we could never maintain. The distribution of local installations probably would lead to hundreds of support questions and we (only 2 people) are already busy with those that concern the version we control and know. We are not able to give support concerning installation issues or questions like 'how is the conservation internally stored?' or errors that occur only in the versions modified by the users. Moreover, you would need a very powerful hardware and highly optimised server to reach the same speed as the online version. MutationTaster uses a database which is tens of GBs in size with parts of Ensembl and the 1000 Genomes data in it. Additionally, we use some external tools for which we have signed disclosure agreements and which we are hence not allowed to share with other groups anyway. If you want to integrate MutationTaster in your own analysis pipeline for Next Generation Sequencing data, we suggest to use our QueryEngine that can be called via Perl's WWW::Mechanize module and similar approaches.

What does the AA changes score mean and how does it influence the prediction?

The score is taken from the Grantham Matrix for amino acid substitutions and reflects the physicochemical difference between the original and the mutated amino acid. It ranges from 0.0 to 215 but does not provide a value for amino acid insertions/deletions. However, the score is only displayed for information purpose and does not influence the prediction. Instead, MutationTaster uses the frequency of the respective AA exchange in known disease causing mutations and polymorphisms for the classification.

Why is the same variant classified as polymorphism when there is an amino acid exchange and as disease causing when there is no amino acid exchange?

MutationTaster uses three different models (without_aae, simple_aae, complex_aae) for its prediction. Depending on the type of variant, MutationTaster automatically determines the correct model. Each model was trained with a suitable set of known polymorphisms/ disease mutations and the prioritisation of the individual parameters differs among the different models. Thus the prediction of a variant might not be the same, if two different models are applied (e.g. without_aae model and simple_aae model). In some cases with a 'disease' prediction due to DNA related features such as strong conservation, knowledge of the effect of amino acid substitution can 'weaken' the prediction, e.g. if the difference of the two amino acids is modest and no protein domains are affected. This is a consequence of the different models: If we used only one, all 'silent' mutations would be considered as polymorphisms - and we decided to rather risk false positives than to lose any true positives.

Why does the MutationTaster splice site prediction differ from the one obtained by NNsplice when entering the variant manually?

MutationTaster uses a locally installed version of NNsplice, which allows to search for non canonical splice sites. This feature is not available in the online version of NNsplice. This might lead to differences between splice site predictions displayed by the online version of NNSplice and splice site predictions displayed within the MutationTaster results. Apart from running NNSplice with different parameters, we also apply some post-filtering of the splice site predictions, in order to remove results irrelevant for the MutationTaster Bayes classifier. In generel, we only display splice site predictions which differ between wildtype and mutated sequence. Amongst others, we do not display increases in scores of predicted canonical splice sites which are indeed the "real", used splice site (decreases in the score of the "real", used splice site are however displayed) and we do not display the decrease of the score or the complete loss of cryptical splice sites distant from intron/exon borders. Please see our documentation for more details on the splicing function in MutationTaster.
After the initial publication in 2010, we have optimised our splice site predictions by applying a large test set of known splice sites and various settings for NNsplice. The one we finally chose (threshold 0.3, non-canonical splice sites included) gave the best overall accuracy of splice site prediction (70%). However, splice site predictions should always be taken with care because they are, unfortunately, not very accurate. We are currently evaluating more predictors and will hopefully be able to offer a much more precise model in the near future.

QueryEngine

Why are there so many cases with no prediction (n/a)?

Most of the n/a cases are due to a missing link between an Ensembl transcript and an NCBI gene (error message: no NCBI gene ID found for this transcript). Ensembl has far more genes and transcripts annotated than NCBI, however, we need to link the Ensembl genes to NCBI in order to get the HGNC genesymbol and SwissProt Accession ID. To circumvent this, we plan to fetch SwissProt ID and genesymbol also via Ensembl in the future, so that in case of a missing link to NCBI, the analysis can be conducted neverthess.

Why are there so many outsides genes, although I have uploaded a VCF file from Exome Sequencing?

Target enrichment is not 100% perfect, thus it is normal that there are variants outside genes. Moreover, we do not use all available transcripts (see MutationTaster FAQs), because some are not suitable for analysis with MutationTaster. In case there is a variant in a gene which only has transcripts not suitable for MutationTaster analysis, this will be counted as outside gene