Next Generation Sequencing pipeline
installation of BLAT on Linux/UnixThe BLAT source code (current version: blatSrc34.zip) can be downloaded from Jim Kent's Website and unzipped. In a (bash) shell, type
This software handles sequences in FASTA, CSFASTA, and FASTQ format, i.e. Illumina Genome Analyzer, Roche 454, and ABI SOLiD reads. There exist more efficient alignment algorithms than BLAT, but most of them are platform-specific and cannot handle reads of arbitrary lengths. In order to speed up the alignment procedure, the reference sequence can be reduced to a set of target regions (e.g. exome sequencing) and additionally reads can be split into smaller chunks to parallelise the alignment step.
If your computer accesses the Internet via a proxy server, the environmental variable http_proxy has to be set. This can be done by typing:
> export http_proxy=http://proxy.my.place/ (on bash shells)
> setenv http_proxy http://proxy.my.place/ (on C shells)
The main program is seq2snippet.pl. This script reads sequences in any of the formats mentioned above. It also needs a FASTA file with the reference genome sequences, and a BED file which specifies a set of target regions, if the query shall be restricted to certain regions. The BED file can either specify one region, all annotated transcripts, or the whole genome. If no target regions are specified, then the whole reference sequence in the FASTA file is used as target. If (instead of the reference sequence file) a directory including all chromosomes is specified, the mapping will be carried out for the whole genome.
An annotation file with the current ENSEMBL transcript annotations also has to be provided. If this file does not exist, it will be automatically downloaded from our website
The second script, uc_getGenomicSequence.pl is automatically called by seq2snippet.pl. and extracts a set of query sequences from a genomic FASTA file.
If seq2snippet.pl is executed with no arguments, a help is printed:
seq2snippet.pl -f reads.fa -g genome.fa -t target.bed
-t FILENAME target regions in bed format
-g FILENAME genomic sequences in fasta format
-c NAME genomic sequence of chromosome NAME in fasta format
-a FILENAME annotation file in tab separated format
-n NAME name of the experiment
-p NUMBER number of processors to use for the mapping
-tile NUMBER the tile size for the blat search (18)
-mis the number of mismatches in the seed (1)
short read input:
-f FILENAME query sequences in fasta format
-csf FILENAME query sequences in csfasta format (SOLiD)
-fastq FILENAME query sequences in fastq format (GAII)
-len Minimal alignment percentage for a single read (0.95)
-frq Minimal frequency of a variation (0.11)
-cov Minimal coverage of aligned reads for a position (10)
perl seq2snippet -c 15 -t EXAMPLE_FBN1/targets.bed -fastq EXAMPLE_FBN1/fbn_reads.fastq -p 4 -n FBN
The name EXPNAME of the experiment (option -n EXPNAME) determines the name of the output files. If this option is not used, all output files will have the prefix experiment_. The tileSize for the blat search is set to 18, the number of tiles is set to 1 and a length and percentage identity threshold can be set by -len, otherwise default parameters are used.
parallelising the alignment stepThe alignment step can be parallelised; the -p option specifies the number of processors that will be used to run blat. An example using four processors would be:
genome-wide and target region mapping, alignment and SNP-calling parameters
There are three filtering options in the program:
-len: minimal percentage of a read that has to be found in an alignment, the same value will also be used as percentage identity parameter in the BLAT search
-cov: minimal coverage for a single position to be regarded as an alteration
-frq: frequency of the deviation from the reference sequence relative to the coverage
Setting the coverage threshold to 1 would report each encountered deviation and by changing the variation frequency one can explicitly look for homozygous variants. There is one additional option (-CDS) which refines the reference sequence to regions, which intersect coding sequences. For large amounts of sequencing data, this will speed up the mapping procedure (although there is an overhead due to the construction of the target sequence database).
Please note that a read is assigned to the position where it aligns best, given the length threshold. That means that reads mapping to other genomic regions which are not included in the analysis might erroneously be regarded as 'strong' alterations in the target region! To circumvent this bias, paralogous sequences should be included in the target set or the whole genome should be used as a reference.
example: simulation for FBN1We generated sample data, in which we introduced 400 SNPs (228 in coding sequence) into a 400kb target region encompassing the FBN1 gene. For this region, we sampled 250,000 reads of length 36 bp and ran the seq2snippet mapping pipeline with varying parameters such as coding sequence filter and whole genome mapping in order to evaluate impacts on SNP detection and running times (on a machine with 4 processors).
|1)||Fbn1||yes||1:10 min||230 kb||204 (228)||577|
|2)||Fbn1||no||1:40 min||400 kb||359 (400)||1282|
|3)||Build 37||yes||8h30 min||750 kb||200 (228)||1680|
|4)||Build 37||no||14h||1554 kb||350 (400)||1458|
In batch queries, only the first three columns (transcript, snippet, name) are considered.
The NAME_SNPs.bed and NAME_read_coverage.wig files contain the 10bp window read coverage and the location of each alteration and can be displayed in the UCSC Genome Browser (Kuhn 2009). The raw counts for all variations are stored in NAME_frequencies.txt. The first three columns denote chromosome, start and end positions of the reference sequences, followed by the reference nucleotide (multiple nucleotides in case of a deletion), the coverage of this position, all counts for nucleotides A,C,G,T, other variants like insertions and a marker (=,!), indicating whether a variation at this position was found.
If also SNP-calling was done by another program, the script (snp2snippet.pl) can be used to convert genomic positions of identified sequence alterations into batch query input for MutationTaster. snp2snippet.pl takes as input: