Yum, tasty mutations...

Mutation T@ster

Next Generation Sequencing pipeline

Requirements

Application of the Next Generation Sequencing pipeline requires some basic knowledge of Unix. At least, the user needs to know standard shell commands and should be able to edit some configuration files.

The mapping tool requires BLAT (the blast-like alignment tool, Kent 2002), in particular, the program faToTwoBit for building binary sequence databases and the alignment program blat are used. Additionally, Perl, the Perl module LWP::UserAgent and Java version 1.6 have to be installed, which should already be the case on most *nix systems. However, in case of problems you should check whether you run a current Java version. The most recent version of the Java run time environment (JRE) can be downloaded from Sun's website.

installation of BLAT on Linux/Unix

The BLAT source code (current version: blatSrc34.zip) can be downloaded from Jim Kent's Website and unzipped. In a (bash) shell, type

> wget http://hgwdev.cse.ucsc.edu/~kent/src/blatSrc34.zip
> unzip blatSrc34.zip
> cd blatSrc

To install and run blat, the environment variable MACHTYPE (machine type) has to be set. You can use the command

> echo $MACHTYPE

to see whether it is already set. If not, type

> uname -m

and set the MACHTYPE to the output.

On a bash shell, type:
> export MACHTYPE=WHAT_UNAME_RETURNED

On a C shell, type:
> setenv MACHTYPE WHAT_UNAME_RETURNED



Create the directory ~/bin/$MACHTYPE by typing

> mkdir ~/bin/$MACHTYPE

If it does not exist, create a MACHTYPE folder in the blatSrc/lib directory. Finally change to the blatSrc folder and type

> make

It might become necessary to modify your shell configuration files like .bashrc or .cshrc to include the blat executable in the $PATH variable and to set the MACHTYPE automatically whenever a shell is started. For example, add the following lines to the end of the .bashrc in your home directory if you use a bash shell.

MACHTYPE=x86_64
export MACHTYPE
export PATH="$PATH":~/bin/$MACHTYPE

The README file in the installation directory contains detailed installation instructions.

Aligning Next Generation data

download

This software handles sequences in FASTA, CSFASTA, and FASTQ format, i.e. Illumina Genome Analyzer, Roche 454, and ABI SOLiD reads. There exist more efficient alignment algorithms than BLAT, but most of them are platform-specific and cannot handle reads of arbitrary lengths. In order to speed up the alignment procedure, the reference sequence can be reduced to a set of target regions (e.g. exome sequencing) and additionally reads can be split into smaller chunks to parallelise the alignment step.

If your computer accesses the Internet via a proxy server, the environmental variable http_proxy has to be set. This can be done by typing:

> export http_proxy=http://proxy.my.place/       (on bash shells)
> setenv http_proxy http://proxy.my.place/       (on C shells)

The main program is seq2snippet.pl. This script reads sequences in any of the formats mentioned above. It also needs a FASTA file with the reference genome sequences, and a BED file which specifies a set of target regions, if the query shall be restricted to certain regions. The BED file can either specify one region, all annotated transcripts, or the whole genome. If no target regions are specified, then the whole reference sequence in the FASTA file is used as target. If (instead of the reference sequence file) a directory including all chromosomes is specified, the mapping will be carried out for the whole genome.
An annotation file with the current ENSEMBL transcript annotations also has to be provided. If this file does not exist, it will be automatically downloaded from our website
The second script, uc_getGenomicSequence.pl is automatically called by seq2snippet.pl. and extracts a set of query sequences from a genomic FASTA file.

If seq2snippet.pl is executed with no arguments, a help is printed:

> seq2snippet.pl

 seq2snippet.pl -f reads.fa -g genome.fa -t target.bed

-t FILENAME target regions in bed format
-g FILENAME genomic sequences in fasta format
-c NAME genomic sequence of chromosome NAME in fasta format
-a FILENAME annotation file in tab separated format
-n NAME name of the experiment
-p NUMBER number of processors to use for the mapping
-tile NUMBER the tile size for the blat search (18)
-mis the number of mismatches in the seed (1)

short read input:
-f FILENAME query sequences in fasta format
-csf FILENAME query sequences in csfasta format (SOLiD)
-fastq FILENAME query sequences in fastq format (GAII)

filter rules:
-len Minimal alignment percentage for a single read (0.95)
-frq Minimal frequency of a variation (0.11)
-cov Minimal coverage of aligned reads for a position (10)

Example:
perl seq2snippet -c 15 -t EXAMPLE_FBN1/targets.bed -fastq EXAMPLE_FBN1/fbn_reads.fastq -p 4 -n FBN

The name EXPNAME of the experiment (option -n EXPNAME) determines the name of the output files. If this option is not used, all output files will have the prefix experiment_. The tileSize for the blat search is set to 18, the number of tiles is set to 1 and a length and percentage identity threshold can be set by -len, otherwise default parameters are used.

parallelising the alignment step

The alignment step can be parallelised; the -p option specifies the number of processors that will be used to run blat. An example using four processors would be:
> perl seq2snippet.pl -c 15 -t EXAMPLE_FBN/targets.bed -fastq EXAMPLE_FBN/fbn_reads.fastq -p 4 -n FBN

genome-wide and target region mapping, alignment and SNP-calling parameters

There are three filtering options in the program:

-len: minimal percentage of a read that has to be found in an alignment, the same value will also be used as percentage identity parameter in the BLAT search

-cov: minimal coverage for a single position to be regarded as an alteration

-frq: frequency of the deviation from the reference sequence relative to the coverage

Setting the coverage threshold to 1 would report each encountered deviation and by changing the variation frequency one can explicitly look for homozygous variants. There is one additional option (-CDS) which refines the reference sequence to regions, which intersect coding sequences. For large amounts of sequencing data, this will speed up the mapping procedure (although there is an overhead due to the construction of the target sequence database).

Please note that a read is assigned to the position where it aligns best, given the length threshold. That means that reads mapping to other genomic regions which are not included in the analysis might erroneously be regarded as 'strong' alterations in the target region! To circumvent this bias, paralogous sequences should be included in the target set or the whole genome should be used as a reference.

example: simulation for FBN1

We generated sample data, in which we introduced 400 SNPs (228 in coding sequence) into a 400kb target region encompassing the FBN1 gene. For this region, we sampled 250,000 reads of length 36 bp and ran the seq2snippet mapping pipeline with varying parameters such as coding sequence filter and whole genome mapping in order to evaluate impacts on SNP detection and running times (on a machine with 4 processors).
Run Reference CDS Time Covered true predicted
Filter Sequence SNPs SNPs
1) Fbn1 yes 1:10 min 230 kb 204 (228) 577
2) Fbn1 no 1:40 min 400 kb 359 (400) 1282
3) Build 37 yes 8h30 min 750 kb 200 (228) 1680
4) Build 37 no 14h 1554 kb 350 (400) 1458

Output

The NAME_snippets.tsv contains all alterations of the run NAME that could be mapped on the Ensembl transcripts specified in the annotation file and located in the target region(s). This file is used to start a batch query in MutationTaster. It is tab-delimited and has the following columns:
transcript ID snippet name freq cov chr start end
ENST00000408784 AGTTAGGCCAGTGATGGTTT[A/G]AAAGTAATGGACAG experiment1000000 116 120 chr4 78842966 78842966

In batch queries, only the first three columns (transcript, snippet, name) are considered.

The NAME_SNPs.bed and NAME_read_coverage.wig files contain the 10bp window read coverage and the location of each alteration and can be displayed in the UCSC Genome Browser (Kuhn 2009). The raw counts for all variations are stored in NAME_frequencies.txt. The first three columns denote chromosome, start and end positions of the reference sequences, followed by the reference nucleotide (multiple nucleotides in case of a deletion), the coverage of this position, all counts for nucleotides A,C,G,T, other variants like insertions and a marker (=,!), indicating whether a variation at this position was found.

Interface to other alignment programs

We provide a Perl script (cns2snippet.pl) that converts data from cns format into MutationTaster snippets, so that other software can also be used to align the raw data. This program takes as input a consensus.cns file, transcript annotations and a reference sequence FASTA file and reports all variations above a certain coverage and variation frequency threshold that map to Ensembl transcripts. The output is the standard batch query format of MutationTaster.

If also SNP-calling was done by another program, the script (snp2snippet.pl) can be used to convert genomic positions of identified sequence alterations into batch query input for MutationTaster. snp2snippet.pl takes as input:

Contact

In case you discover bugs, have suggestions or questions, please write an e-mail to
Christian Rödelsperger (christian.roedelsperger AT charite.de) or to
Dominik Seelow
(dominik.seelow AT charite.de).

References