Yum, tasty mutations...

Mutation T@ster

Documentation

documentation
input output statistics error messages
Bayes classifier QueryEngine known bugs and limitations contact
examples & statistics
examples & tutorial statistics - cross-validation & comparison with similar tools

Input

Modifyable HTML elements are highlighted in blue

Gene

You can identify your gene of interest by entering one of the following:

- HGNC symbol e.g. LEP (case insensitive)
- NCBI GeneID e.g. 3952
- Ensembl gene ID (starting with ENSG, e.g. ENSG00000174697)

MutationTaster will automatically recognise the type of input. Upon clicking somewhere out of the input field, the cursor leaves the field and available Ensembl transcripts for your gene will be displayed.

Transcript

You can also directly enter the Ensembl transcript id (starting with ENST, e.g. ENST00000308868) of your gene of interest.
In this case, you do not need to fill out the gene identifier field.

At present, it is *absolutely necessary* that you indicate an Ensembl Transcript ID, either by choosing one from the radiobutton menu or by directly typing it into the appropriate field. For the future, we plan to bypass this step but right now it is still essential. If more than one transcript is available, they are ordered by length from top to bottom.

Position / snippet refers to

Choose coding sequence if you are working with coding sequence positions / sequence for localising the alteration of interest. Coding sequence (CDS) position 1 refers to the A of the start ATG (and is sometimes also called ORF, for open reading frame).
Choose transcript (cDNA sequence) if you are working with cDNA positions / sequence for localising the alteration of interest. cDNA position 1 refers to the first base of the transcript.
Choose gene (genomic sequence) if you are working with gDNA positions / sequence for localising the alteration of interest. Genomic sequence (gDNA) position 1 refers to the first base of the gene.

Alteration, all types by sequence

Choose all types by sequence if you have a sequence snippet around an alteration that you want to analyse. You can paste this sequence snippet into this field, putting square brackets [ ] around the altered base and the new base (e.g. ACGGTT[A/G]CTCTAAGGA for a base exchange from A to G). Comprehensive examples of the format are provided directly on the input mask. Additionally, you have to (1) indicate the HGNC symbol of the gene in question and (2) the transcript ID (or select one after entering a gene). All entries have to refer to the 5'-3' direction of the transcript sequence.

Alteration, single base exchange by position

Choose single base exchange by position if you are working with a single base exchange. This means, that only one single base is altered. If you have named the mutation according to the HGVS variation nomenclature there should be indicated whether you have to work in the coding sequence (CDS) or gDNA mode.

Enter the position of the base exchange. Important: in coding sequence mode, position 1 refers to the A in the ATG start codon; in transcript (cDNA sequence) mode, position 1 refers to the first base in the cDNA transcript, which is mainly part of the 5'UTR. Positions must not exceed the length of the sequence.

Upon changing the content of the input field and clicking somewhere out of it, the sequence snippet surrounding the indicated (exchanged) base of interest will appear at the bottom of the screen. The wild-type base affected by the base exchange is highlighted in blue. Please check if the highlighted base is concordant with the one you wanted to indicate and also whether the surrounding sequence is correct.

Then fill in the new base. For a base exchange c.1204G>T you would enter a T as new base(s).

Alteration, insertion or deletion by position

Choose insertion / deletion by position if you are working with an insertion, a deletion or a combination thereof. You do not need to further specify which kind of alteration you are exactly dealing with, since this is automatically determined by the software (and displayed in the output).

Enter the region of the alteration in the order the input fields are arranged on the screen:
...last wild type base before alteration refers to the base directly preceding the alteration. For a deletion of three nucleotides, e.g. c.92_94delGAC (or c.92_94del3 or c.92_94del), enter 91.
...first wild type base after alteration refers to the base directly following the alteration. For a deletion of three nucleotides, e.g. c.92_94delGAC (or c.92_94del3 or c.92_94del), enter 95.

Position 1 refers to the first base of the ORF / cDNA / gDNA, depending on the chosen mode. Positions must not exceed the length of the gene's ORF / cDNA / gDNA sequence.

Upon changing the content of the ...first wild type base after alteration field and clicking somewhere out of it, the sequence snippet surrounding the indicated altered region of interest will appear at the bottom of the screen. The wild-type base(s) affected by the alteration is / are highlighted in blue. Please check if the highlighted base(s) is / are concordant with the one(s) you wanted to indicate and also whether the surrounding sequence is correct.

Enter the new bases. For example in case of an insertion of a GAGA-sequence between nucleotides 51 and 52 of the coding region (c.51_52insGAGA), enter GAGA. For a deletion of three nucleotides like c.92_94delGAC (or c.92_94del3 or c.92_94del) simply enter nothing.

options

Check 'show nucleotide alignment' if you want to see multi-species alignment of nucleotide sequence around the submitted alteration in the results. By default, nucleotide alignment is not run, since the BLAST call slows down MutationTaster and the results are not used by the Bayes Classifier anyway.

Name of alteration

You can enter a self-chosen name for the alteration in question. This will be displayed in the output in order to facilitate the identification of printed outputs for different mutations in the same gene.

 

Bayes classifier

Bayes classifier

MutationTaster employs a Bayes classifier to eventually predict the disease potential of an alteration. The Bayes classifier is fed with the outcome of all tests and the features of the alterations and calculates probabilities for the alteration to be either a disease mutation or a harmless polymorphism. For this prediction, the frequencies of all single features for known disease mutations/polymorphisms were studied in a large training set composed of >390,000 known disease mutations from HGMD Professional and >6,800,000 harmless SNPs and Indel polymorphisms from the 1000 Genomes Project (TGP).

Models

We provide three different models aimed at different types of alterations, either aimed at 'silent' (non-synonymous or intronic) alterations (without_aae model), at those leading to the substitution/insertion/deletion of a single amino acid (simple_aae model) or at more complex changes of the amino acid sequence (e.g. mutations introducing a premature stop codon, etc - complex_aae model). All models were trained with all available and suitable common polymorphisms and disease mutations. MutationTaster automatically determines the correct model for each alteration.

Output: probability value

The probability value is the probability of the prediction, i.e. a value close to 1 indicates a high 'security' of the prediction. Please note that the p value used here is NOT the probability of error as used in t-test statistics.
Our results show that wrong predictions are usually not reflected by low probability values but are rather caused by polymorphisms or disease causing alterations that show characteristics of the other case, e.g. SNPs that are highly conserved and destroy protein features or disease mutations that appear to have no effect on the protein/gene at all.

If an alteration is a 'true' SNP (as confirmed by the existence of each of the three genotypes AA, AB, BB in the HapMap data or by presence in TGP in homozygous state in > 4 cases), it is automatically predicted to be a polymorphism. Alterations causing a premature termination codon and ultimately leading to nonsense-mediated mRNA deday (NMD) are automatically assigned the 'disease causing' status. In both cases, the Bayes classifier is run nevertheless and the probability for the prediction that was automatically made is shown. Scores below 0.5 hence indicate, that our classifier comes to a different conclusion. A few SNPs listed in HapMap introduce premature stop codons and will cause NMD; these are likely to be mistaken for disease mutations.
We advise you not to exclude an alteration due to a dbSNP ID. Many SNPs from dbSNP are not validated and some are even known to be disease causing variants (e.g. rs28939070 is responsible for Trichorhinophalangeal Syndrome, type I).
Since we used 'true' SNPs from the 1000 Genomes Project as our polymorphism data set, we did not include Genotype data from the 1000 Genomes Project (and HapMap frequencies either) in the training and optimisation of MutationTaster nor in the comparison with other applications.

The Bayes classifier is regularly updated, i.e. predictions might in some cases change over time.

Statistics

Click here for detailed information about and results of the cross-validation and the comparison of MutationTaster with similar tools.

Output

The different elements of the output are named and described below.

summary

List of the most prominent features of the analysed alteration (e.g. 'at intron-exon boundary', 'spans start ATG', 'homozygous in TGP' etc.)

name of alteration

A user-specified name in order to identify printed outputs.

alteration (phys. location)

The alteration on "physical" i.e. chromosomal level (e.g. chr7:91623937_91623938insGGCAAT).

HGNC symbol

The official HGNC symbol.

Ensembl transcript ID

Ensembl [1] transcript ID, starting with ENST.

UniProt peptide (SwissProt ID)

UniProt KB / SwissProt [2] accession ID. Unfortunately, this does not always correctly correspond to the selected product of the transcript.

alteration type

Is either a base exchange, a combination of insertion and deletion, an insertion or a deletion.

alteration region

Is either 5'UTR (untranslated region), CDS (coding sequence), 3'UTR or intron.

DNA changes

Alteration on nucleotide level. gDNA level (g.) is displayed always, cDNA level (cDNA.) for alterations located in exons, CDS level (c.) only for alterations residing in an exon in the coding sequence.

AA changes

Any amino acid changes are shown here, displaying the original versus the new amino acid as well as the position of the substitution and a score for it. This score is taken from an amino acid substitution matrix (Grantham Matrix [3]) which takes into account the physico-chemical characteristics of amino acids and scores substitutions according to the degree of difference between the original and the new amino acid. Scores may range from 0.0 to 215. Since the Grantham matrix does not provide values for an amino acid insertion/deletion, no score is given in such cases. The score is only displayed for information reasons and does not influence the MutationTaster prediction as generated by our Bayes classifier. An asterisk (*) stands for a stop codon, a minus (-) means that in the original AA sequence, there was no AA at this position. If the initial Methionine codon (startATG) is lost, MutationTaster searches for a potential new, downstream startATG and informs you about AA changes based on the assumed alternative AA sequence.

position(s) of altered AA

Lists the positions of altered AA. For mutations resulting in a frameshift, the position of the first altered AA is displayed along with the information that due to a frameshift, there are further changes downstream.

frameshift

Can be either yes or no.

dbSNP / TGP / ClinVar / HGMD

Any known polymorphism(s) or known disease variant that have been found at the position in question. Our database contains all single nucleotide polymorphisms (SNPs) from the NCBI SNP database (dbSNP). Moreover, we have stored all HapMap genotype frequencies as well as variants from the 1000 Genomes Project [4] (abbreviated here as TGP). If an alteration is located at the same position as a known dbSNP, MutationTaster provides the SNP ID (or rs ID) and a link together with the HapMap genotype frequencies, if available. If every of the three possible geno-types is observed in at least one HapMap population, the alteration is automatically regarded as a polymorphism (the naive Bayes classifier is run nevertheless and the p value for the prediction is shown). Please note that there may be differences between your alteration and the alleles in dbSNP. For the 1000 Genomes Project, MutationTaster provides information in either of the following formats:
> 4 cases homozygous in TGP: TGP: allele_alt/allele_alt found more than 4 times in TGP data: #homozygous_hits
> 4 cases heterozygous in TGP: TGP: allele_ref/allele_alt found more than 4 times in TGP data: #heterozygous_hits (#homozygous_hits for allele_alt/allele_alt)
< 4 cases homo-/heterozygous in TGP: TGP: allele_ref/allele_alt found #heterozygous_hits times in TGP data, allele_alt/allele_alt #homozygous_hits times.
If an alteration was found more than 4 times homozygously in TGP, it is automatically regarded as polymorphism.
We also display known disease variants from dbSNP ClinVar. If a variant is marked as probable-pathogenic or pathogenic in ClinVar, it is automatically predicted to be disease-causing (the naive Bayes classifier is run nevertheless and the p value for the prediction is shown).
Moreover, we have integrated the public version of the Human Gene Mutation Database (HGMD) [5]. The data includes the positions of the disease mutations and their HGMD ID. The disease alleles are not included so we cannot use HGMD for automatic predictions. Whenever an HGMD public disease mutation is found at the same position as a variant, this will be written in the summary. We also place a direct hyperlink to the mutation in HGMD into the 'dbSNP / TGP / HGMD(public) / ClinVar' field, so you can check whether the HGMD mutation has the same allele as your variant (and whether the disease matches). Please note that you must be logged in at the HGMD site to make the hyperlink work - access to the public version is free but requires registration.

regulatory features

Our database contains so-called regulatory features from the Ensembl Regulation database, such as histone modification sites, open chromatin or transcription factor binding sites. For more information about Ensembl Regulation, please see their documentation. Since it is not yet clear if and how the regulatory features influence the gene under scrutiny or rather up- / downstream genes, the regulatory features are not used by the Bayes classifier for prediction, but only displayed for informational reasons here.

phyloP / phastCons

phastCons and phyloP are both methods to determine the grade of conservation of a given nucleotide [6]. MutationTaster uses values which are precomputed and offered by UCSC. phastCons values vary between 0 and 1 and reflect the probability that each nucleotide belongs to a conserved element, based on the multiple alignment of genome sequences of 46 different species (the closer the value is to 1, the more probable the nucleotide is conserved). It considers not just each individual alignment column, but also its flanking columns. By contrast, phyloP (values between -14 and +6) separately measures conservation at individual columns, ignoring the effects of their neighbors. Moreover, phyloP can not only measure conservation (slower evolution than expected under neutral drift) but also acceleration (faster than expected). Sites predicted to be conserved are assigned positive scores, while sites predicted to be fast-evolving are assigned negative scores. For more information about phyloP and phastCons, please see the cited paper or the description on the UCSC website.
See how well both scores perform on real data and how these scores are used by MutationTaster2.

splice sites

MutationTaster uses a locally installed third party splice site prediction program, namely NNSplice [7] from the Berkeley Drosophila Genome Project (a web-based version is available at http://fruitfly.org/seq_tools/splice.html) to analyse possible changes in splice sites.
To this end, a gDNA sequence snippet of around 60 bases (alteration plus 30 bases up- and downstream) in its mutated and wild-type form is created and submitted to NNSplice. If there are any changes in the mutated sequence (existing splice site got stronger i.e. increased or weaker i.e. decreased, additional splice site activated i.e. gained or splice site completely lost), MutationTaster determines the position of this splice site change relative to intron/exon borders: if a loss/decrease of a splice site occurs at an intron/exon border or exon/intron border, this will be taken for a "real" splice site change. Loss/decrease of splice sites distant from intron/exon (and reverse) borders will be ignored. A gain of a completely new splice site is displayed, if the confidence score of the newly created splice site is greater than 0.3. An increase in an already existing splice site will be displayed if the change in the confidence score is greater than 10%. All (not ignored) changes are displayed with the effect, genomic position of the splice site, the prediction score for wild-type (wt) and/or mutated (mu) splice site as generated by NNSplice, the (wildtype) detection sequence itself and very short sequence information about the splice site, with a pipe (|) indicating the border between intron and exon.
Splice site analysis is turned-off by default for mitochondrial genes.

Kozak consensus sequence altered

The Kozak consensus sequence (gccRccAUGG; R = purine) starts upstream of the start codon (AUG) and plays a major role in the initiation of translation. The purine (R) at position -3 as well as the G in position +4 are highly conserved. The program checks whether for a given alteration a previously strong consensus sequence has been weakened.

conservation on AA level

For conservation analysis, amino acid or nucleotide sequence homologues of ten other species (chimp, rhesus macaque, mouse, cat, chicken, claw frog, pufferfish, zebrafish, fruitfly, and worm) are aligned with the corresponding human sequence of the gene in question. Sequences are aligned with blastp [8], which is installed as stand-alone executable on our server, and analysed by MutationTaster.
The status of evolutionary conservation is either classified as all identical (i.e. the same amino acid in the human and the homologue amino acid sequence) (partly) conserved (i.e. similar amino acids in the human and the homologous amino acid sequence) or not conserved (i.e. different amino acids in the human and the homologue amino acid sequence). The status for local nucleotide sequence alignments is either conserved or not conserved. Additionally, MutationTaster states when no homologous gene is known or no alignment could be made. Alignments are shown as snippets for each species, including the position of the analyzed residue, the alignment and the status. We de-liberately restrict conservation analysis to ten animal species, although sequence data for far more species is available. The inclusion of further species did not have considerable influence on pre-diction accuracy but each alignment significantly decreased the speed of MutationTaster.

protein features

The program checks whether any protein features are directly or indirectly affected by the alteration. Our database stores all human SwissProt protein features. Some features will not have an influence on the prediction; they are only displayed for information and should not have an impact on the disease-causing potential of the alteration (e.g. CONFLICT or MUTAGEN).
Lost means that the AA exchange invoked by the alteration in question is located within the protein feature. A protein feature might get lost if a whole exon is skipped due to splice site changes, or if a protein is shortened because of a premature termination codon - in those cases, protein features are indirectly affected.

length of protein

MutationTaster checks if the resulting protein will be elongated (prolonged), truncated, or whether nonsense-mediated mRNA decay (NMD) is likely to occur. MutationTaster determines the NMD border as last intron/exon junction minus 50 bp and analyses if a given premature termination codon occurs 5' to this border thus leading to NMD. An elongated protein is referred to as prolonged, i.e. the original termination codon is destroyed and the translation stops later than normal. Truncated is reffered to as either slightly truncated (if less than 10% of the wild-type protein length are missing) or strongly truncated (if more than 10% of original protein length are missing). In the two latter cases, the additional information 'might cause NMD' is given, because the '-55 boundary rule' is not fulfilled, but it cannot be ruled out that NMD occurs nevertheless. If MutationTaster concludes that an alteration causes NMD, this alteration is automatically regarded as a disease mutation. The classifier is run never-theless and the p value for the prediction is shown.

AA sequence altered

Can be either yes (AA exchange) or no (no AA exchange)

position(s) of altered AA

If the alteration in question is located in the CDS, the position on amino acid level is shown here. If the alteration spans two or more amino acids, these are all displayed and separated by a comma.

position of stopcodon in wt / mu CDS

Position of the last base of the stop codon (this can either be TGA, TAA or TAG), position 1 refers to the A in the start ATG codon.

position (AA) of stopcodon in wt / mu AA sequence

Position of the stop asterisk (*) in the amino acid sequence, position 1 refers to the first amino acid of the protein.

poly(A) signal

MutationTaster uses a locally installed version of the program polyadq [9] for analysis of polyadenylation signals. More information at http://rulai.cshl.org/tools/polyadq/polyadq_form.html

conservation on nucleotide level

Conservation on nucleotide level is analysed similarly to AA level: Using bl2seq, homologue DNA sequences of different species are compared to the human DNA sequence. Conservation status can either be all identical (same base(s) in human and species sequence), not conserved (different base(s) in human and species sequence) or no alignment (if no local alignment around the indicated position(s) was found). If no homologue sequences are found, this is indicated by no homologue. Up to now, conservation on nucleotide level is not used for the prediction.

position of start ATG in wt / mu cDNA

Position of the A in the start ATG, position 1 refers to the first base of the cDNA. If the regular start ATG is changed by an alteration, MutationTaster searches for the next most 5'-ATG and assumes this to be the new start ATG for the mutated sequence.

position of termination codon in wt / mu cDNA

Position of the last base pair of the termination codon (this can be either TGA, TAA or TAG), position 1 refers to the first base pair of the cDNA.

chromosome

The chromosome the alteration is located on.

strand

Is either 1 for forward strand or -1 for reverse strand.

last intron/exon border

The last base of the exon before the last exon.

theoretical NMD border in CDS

In order to avoid truncated proteins which might act in a dominant-negative manner, the eukaryotic cell has a surveillance mechanism to ensure that only error-free mRNAs are translated. It was shown that mRNA shorter than a given length is nearly completely degraded. This process is known as nonsense-mediated mRNA decay or NMD. The rule seems to be that a termination codon occurring 50-55 nucleotides upstream of the final intron / exon junction initiates the NMD machinery and the mRNA gets degraded. Therefore, this program determines the NMD border as last intron / exon junction minus 50 bp and analyses if a given premature termination codon occurs 5' to this border thus eventually leading to NMD.

length of CDS

The length of the coding sequence from the A of the initiation codon (ATG) to the last base of the termination codon.

cDNA position

Gives the last wild-type base before alteration and first wild-type base after alteration in coding DNA sequence context (positions relative to start of transcribed coding DNA reference sequence) e.g. 1203 / 1205, the altered base is at position 1204.

gDNA position

Gives the last wild-type base pair before alteration and first wild-type base pair after alteration in genomic DNA sequence context (positions relative to start of genomic DNA reference sequence) e.g. 53,344 / 53,346, the altered base is at position 53,345.

chromosomal position

Gives the last wild-type base before alteration and first wild-type base after alteration in chromosomal sequence context (position relative to start of chromosomal reference sequence) e.g. 154,372,337 / 154,372,339, the altered base is at position 154,372,338.

gDNA and cDNA sequence snippet

The sequence surrounding the alteration (20 bp up- and downstream). The altered bases are highlighted in blue.

wild-type and mutated AA sequence

Complete AA sequences, the asterisk (*) indicates STOP.

speed

This is the time MutationTaster needed for analysis & prediction - your browser might need some extra time to display the results, especially if you include images.

error messages

InsDel too long

At present, MutationTaster handles only InsDels up to 12 bases.

Your mutation of interest seems to span an exon/intron boundary.

This kind of mutation can only be analysed in gDNA mode.

No transcripts for this gene found!

You might have mis-spelled the gene symbol or used a protein name which is not always also the correct symbol (e.g. protein p53 is gene TP53). Also, in some (rare) cases a NCBI gene could not be mapped to an Ensembl gene. As some external data is based on NCBI while other is based on Ensembl, MT needs both to make a prediction. Moreover, we filter out protein-coding transcripts (Ensembl biotype protein_coding) without a correct start codon (ATG) and correct stop codon (TGA, TAA, TAG). This might lead to the phenomenon that MutationTaster complains about "no suitable transcripts" or "no transcripts for this gene found" although Ensembl lists one or several. Transcripts of mitochondrial genes are not tested for integrity due to differences in the mitochondrial genetic code.

No internal Ensembl transcript ID found. / No Ensembl gene ID found for transcript. / No stable ID for this gene.

Our database doesn't know the transcript you specified. This might happen if you refer to a newer or older release than the one we use. The release MT uses is mentioned on the query interface.

Ensembl gene XXX not found in ENSEMBL

Our database doesn't know the gene you specified. This might happen if you refer to a newer or older release than the one we use. The release MT uses is mentioned on the query interface.

No NCBI gene ID found. / No NCBI gene ID found for this transcript.

In some (rare) cases an Ensembl gene could not be mapped to a NCBI gene. As some external data is based on NCBI while other is based on Ensembl, MT needs both to make a prediction.

Too many NCBI gene IDs found.

In some (rare) cases an Ensembl gene could not be mapped to a single NCBI gene. As some external data is based on NCBI while other is based on Ensembl, MT needs both to make a prediction.

Only invalid NCBI gene IDs found.

In some (very rare) cases an Ensembl gene could not be mapped to a valid NCBI gene, i.e. the NCBI gene Ensembl refers to is 'discontinued' and was replaced by another gene. As some external data is based on NCBI while other is based on Ensembl, MT needs both to make a prediction. Please contact us if you encouter such a case.

Gene XXX not found on any chromosome.

The gene under scrutiny has no valid positional data. This should not occur at all. Please contact us if you encouter such a case.

Gene XXX (Entrez gene YYY) and transcript ZZZ do not match!

The transcript you entered is not a product of the gene you entered. Please check your input.

Position is out of gene!

You entered a position that is located outside the gene. This may happen when you mapped genomic position to gene-specific position using an old genome build. Or, of course, by typos. Please check your input.

Could not retrieve a sequence or sequence is too short.

MT was not able to get the gene sequence from Ensembl. This might be due to network problems so you should repeat the analysis after some time. Should this not work, please contact us.

No start ATG exon found.

The transcript is not properly annotated: there is no start position of the coding sequence in the database. Please select another transcript of the same gene.

No stop exon found.

The transcript is not properly annotated: there is no stop position of the coding sequence in the database. Please select another transcript of the same gene.

Chosen transcript ENSTXXX has no correct start ATG annotated.

Protein-coding transcripts (Ensembl biotype protein_coding) are tested for transcript integrity, i.e. for presence of a correct start codon (ATG) and correct stop codon (TGA, TAA, ATG). If one is missing, an error message is thrown out because analysis in corrupt transcripts might lead to a wrong prediction.

Sequence XXX is not unique in your gene!

Please use a longer snippet.

Sequence was not found in your gene.

Please check your input: is there a typo in your snippet? Or do you use a snippet created from the wrong strand? MT always refers to the strand the gene is located on.

Snippet not properly formatted.

Please check your input: snippets must be specified as ACGTACGT[OLDBASES/NEWBASES]ACGTACGT.

Comparison with other prediction tools

2013: SIFT, PROVEAN, PolyPhen-2

Known bugs and limitations

Future plans

Contact

In case you discover bugs, have suggestions or questions, please write an e-mail to
Jana Marie Schwarz (jana-marie.schwarz AT charite.de) or to
Dominik Seelow
(dominik.seelow AT charite.de).
We also appreciate hearing about your general experiences using MutationTaster.

References

[1] Flicek P, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fairley S, Fitzgerald S, Gil L, Gordon L, Hendrix M, Hourlier T, Johnson N, K��h��ri AK, Keefe D, Keenan S, Kinsella R, Komorowska M, Koscielny G, Kulesha E, Larsson P, Longden I, McLaren W, Muffato M, Overduin B, Pignatelli M, Pritchard B, Riat HS, Ritchie GR, Ruffier M, Schuster M, Sobral D, Tang YA, Taylor K, Trevanion S, Vandrovcova J, White S, Wilson M, Wilder SP, Aken BL, Birney E, Cunningham F, Dunham I, Durbin R, Fern��ndez-Suarez XM, Harrow J, Herrero J, Hubbard TJ, Parker A, Proctor G, Spudich G, Vogel J, Yates A, Zadissa A, Searle SM: Ensembl 2012. Nucleic Acids Res. 2012 Jan;40(Database issue):D84-90. doi: 10.1093/nar/gkr991.

[2] Magrane M, Consortium U: UniProt Knowledgebase: a hub of integrated protein data. Database (Oxford). 2011 Mar 29;2011:bar009. doi: 10.1093/database/bar009.

[3] Grantham, R: Amino acid difference formular to help explain protein evolution. Science 185: 862-864 (1974)

[4] 1000 Genomes Project Consortium: An integrated map of genetic variation from 1,092 human genomes. Nature. 2012 Nov 1;491(7422):56-65

[5] Stenson PD, Mort M, Ball EV, Howells K, Phillips AD, Thomas NS, Cooper DN: The Human Gene Mutation Database: 2008 update. Genome Med. 2009 Jan 22;1(1):13. doi: 10.1186/gm13. [6] Pollard KS, Hubisz MJ, Siepel A: Detection of non-neutral substitution rates on mammalian phylogenies. Genome Res. 2010 Jan;20(1):110-21

[7] Berkeley Drosophila Genome Project; Reese MG, Eeckman FH, Kulp D, Haussler D: Improved Splice Site Detection in Genie. J Comp Biol 1997 4;(3), 311-23.

[8] Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403-10.

[9] Tabaska JE, Zhang MQ: Detection of polyadenylation signals in human DNA sequences. Gene 1999;231: 77-86.