The Genome Analyzer Pipeline Software (Pipeline) is used to perform the early data analysis of a sequencing run, including the image analysis, base calling, and alignment. Alignment is performed with Efficient Large-Scale Alignment of Nucleotide Databases (ELAND). The Pipeline also supports the generation of quality scores for the purposes of successfully filtering and truncating reads to enhance sequence accuracy. The error probabilities are routinely reported from signal/noise ratios of each base.
The CASAVA Software package (short for "Consensus Assessment of Sequence And VAriation") performs a post-sequencing analysis of data from reads aligned to the user-selected reference genome by the Pipeline. The core of the application is the “allele caller”. During the build process, the CASAVA Software collates, filters, and compiles aligned reads. CASAVA then calls the genomic consensus sequence using a Bayesian algorithm and compares it to the selected reference sequence in order to identify homozygous or heterozygous SNPs. CASAVA generates a range of statistics, such as mean depth and percentage chromosome coverage, to enable comparison with previous builds or other individuals. The RNA Sequencing analysis tool set built into CASAVA provides read counts for exons, genes, and the splice junctions implicated in alternative splicing of transcripts.
Flicker: We are using Flicker, which is a pre-release software tool designed as an add-on to the Illumina Genome Analyzer Pipeline software for preliminary processing and initial analysis of small RNA sequence runs (miRNA-Seq). Flicker runs on any Unix or Linux-enabled machine. Flicker contains public source alignment target files derived from the current version of miRBase for human, mouse, rat, nonhuman primate and Arabidopsis microRNAs.
Flicker does four main things to facilitate analysis of miRNA-Seq:
GenomeStudio RNA Sequencing Module v1.0: GenomeStudio is a modular software platform for viewing and analyzing data obtained in multiple sequencing applications: ChIP-Seq, DNA-Seq, RNA-Seq and miRNA-Seq. Files are initially generated using the Pipeline and CASAVA after which CASAVA files are downloaded into GenomeStudio. This integrated platform is proprietary to Illumina and we have purchased several copies for analysis in the Next Generation Sequencing Core. We routinely use the RNA Sequencing Module to facilitate data analysis from RNA-Seq and miRNA-Seq runs. This software package is necessary for viewing these extremely large and complex data sets in the reductionist terms of experimental our objectives such as comparing expression levels and splice variants, or detecting SNP’s in specific gene candidates identified in other studies as potentially high value targets for biological significance. Another important function of GenomeStudio is to create standard file formats for efficient data export and sharing.
CLCbio It should be noted here that this is a new and rapidly evolving field. Thus, we anticipate that tools for sequence analysis and viewing will continue to evolve and we regularly survey the literature and attend expert user meetings to stay abreast of these changes. For example, CLC Genomics Workbench (http://www.clcbio.com/index.php?id=1240) is a desktop application we routinely use for analyzing and visualizing Next Generation Sequencing data. The CLC workbench imports millions of short reads within minutes. The workbench is useful for de novo and reference alignments, SNP detection, Deletion Insertion Polymorphism (DIPs) detection, and identification of genomic rearrangements. The software is also valuable for visualization and interactive graphical manipulation of results.
Open source tools: MAQ, Bowtie, Velvet: Another aligner program we use is called MAQ, which stands for Mapping and Assembly with Quality. MAQ builds assemblies by mapping short reads to reference sequences (http://sourceforge.net/projects/maq/). MAQ can also take the results of sequence runs done with the paired end read protocol to perform the fitting necessary for mapping insertions, deletions, and translocations to overcome the challenges of sequence alignment in regions of repetitive elements. Consensus genotypes are called, including homozygous and heterozygous polymorphisms, with a Phred probabilistic quality assigned to each base. Another tool, called Bowtie, is the fastest short read aligner, but does not report statistics on uniqueness/multiple alignment of reads. However, Bowtie does use mapping quality for alignment and is a useful open source tool. Finally, Velvet performs de novo assembly that can leverage short reads in combination with read pairs to produce useful assemblies. Velvet’s algorithms manipulate de Bruijn graphs to enable genomic sequence assembly. A de Bruijn graph is a compact representation based on short ‘words’ (k-mers) that is ideal for high coverage of short read (25-50 bp) data sets. When applied to Illumina data sets without read pairs, Velvet generates contigs of approximately 8 kb in a prokaryote and 2 kb in a mammalian. Thus, Velvet is critical for our use in de novo assembly because parameters can be custom set based on the data at hand and this tool works better in our experience for this purpose than the CLC Workbench and GenomeStudio.