Scripps Research Logo

Next Generation Sequencing Core

Data Analysis

Tools for Processing and Analysis of Next Generation Sequencing Data:

The Genome Analyzer Pipeline Software (Pipeline) is used to perform the early data analysis of a sequencing run, including the image analysis, base calling, and alignment. Alignment is performed with Efficient Large-Scale Alignment of Nucleotide Databases (ELAND). The Pipeline also supports the generation of quality scores for the purposes of successfully filtering and truncating reads to enhance sequence accuracy. The error probabilities are routinely reported from signal/noise ratios of each base.

The CASAVA Software package (short for "Consensus Assessment of Sequence And VAriation") performs a post-sequencing analysis of data from reads aligned to the user-selected reference genome by the Pipeline. The core of the application is the “allele caller”. During the build process, the CASAVA Software collates, filters, and compiles aligned reads. CASAVA then calls the genomic consensus sequence using a Bayesian algorithm and compares it to the selected reference sequence in order to identify homozygous or heterozygous SNPs. CASAVA generates a range of statistics, such as mean depth and percentage chromosome coverage, to enable comparison with previous builds or other individuals. The RNA Sequencing analysis tool set built into CASAVA provides read counts for exons, genes, and the splice junctions implicated in alternative splicing of transcripts.

Flicker: We are using Flicker, which is a pre-release software tool designed as an add-on to the Illumina Genome Analyzer Pipeline software for preliminary processing and initial analysis of small RNA sequence runs (miRNA-Seq). Flicker runs on any Unix or Linux-enabled machine. Flicker contains public source alignment target files derived from the current version of miRBase for human, mouse, rat, nonhuman primate and Arabidopsis microRNAs.

Flicker does four main things to facilitate analysis of miRNA-Seq:

  1. Trimming: Trims off the known Illumina adaptor sequences from the small RNA reads produced by the sequencing run of any experimental sample. This trimming is necessary to create a data file for analysis by comparison to known databases of small RNA sequences. The trimming process is not trivial because variability in the length of the small RNA fragments native in the cellular RNA pool results in sequence reads with the adaptor sequences starting at various positions that make a simple trimming procedure impossible.
  2. Alignment: Once small RNA sequences are generated by trimming, they must be aligned with the selected reference genome in Iterated ELAND. Alignment of trimmed reads to genome sequence targets is done using the ELAND short read aligner, which aligns tags of length 15 to 35 bases.
  3. Sequential Alignment: Flicker aligns the small RNA sequences to the public database files such as miRBase in a sequential fashion based on sequence lengths. Thus, the final miRNAs are only approximately 21-25 bases long. But these mature miRNAs are processed from longer transcribed products that are also present in the cell’s RNA transcript pool, specifically the immature pre-miRNAs (60-90 bases), that will also be detected and represent reads that are longer than 25-30 bases after the adaptor sequences are trimmed.
  4. Summary Reporting: Flicker creates a summary data report including counts of reads and unique sequences, relative abundance of specific sequences, major categories of alignment targets, and statistics on trimmed read lengths.

GenomeStudio RNA Sequencing Module v1.0: GenomeStudio is a modular software platform for viewing and analyzing data obtained in multiple sequencing applications: ChIP-Seq, DNA-Seq, RNA-Seq and miRNA-Seq. Files are initially generated using the Pipeline and CASAVA after which CASAVA files are downloaded into GenomeStudio. This integrated platform is proprietary to Illumina and we have purchased several copies for analysis in the Next Generation Sequencing Core. We routinely use the RNA Sequencing Module to facilitate data analysis from RNA-Seq and miRNA-Seq runs. This software package is necessary for viewing these extremely large and complex data sets in the reductionist terms of experimental our objectives such as comparing expression levels and splice variants, or detecting SNP’s in specific gene candidates identified in other studies as potentially high value targets for biological significance. Another important function of GenomeStudio is to create standard file formats for efficient data export and sharing.

CLCbio It should be noted here that this is a new and rapidly evolving field. Thus, we anticipate that tools for sequence analysis and viewing will continue to evolve and we regularly survey the literature and attend expert user meetings to stay abreast of these changes. For example, CLC Genomics Workbench (http://www.clcbio.com/index.php?id=1240) is a desktop application we routinely use for analyzing and visualizing Next Generation Sequencing data. The CLC workbench imports millions of short reads within minutes. The workbench is useful for de novo and reference alignments, SNP detection, Deletion Insertion Polymorphism (DIPs) detection, and identification of genomic rearrangements. The software is also valuable for visualization and interactive graphical manipulation of results.

Open source tools: MAQ, Bowtie, Velvet: Another aligner program we use is called MAQ, which stands for Mapping and Assembly with Quality. MAQ builds assemblies by mapping short reads to reference sequences (http://sourceforge.net/projects/maq/). MAQ can also take the results of sequence runs done with the paired end read protocol to perform the fitting necessary for mapping insertions, deletions, and translocations to overcome the challenges of sequence alignment in regions of repetitive elements. Consensus genotypes are called, including homozygous and heterozygous polymorphisms, with a Phred probabilistic quality assigned to each base. Another tool, called Bowtie, is the fastest short read aligner, but does not report statistics on uniqueness/multiple alignment of reads. However, Bowtie does use mapping quality for alignment and is a useful open source tool. Finally, Velvet performs de novo assembly that can leverage short reads in combination with read pairs to produce useful assemblies. Velvet’s algorithms manipulate de Bruijn graphs to enable genomic sequence assembly. A de Bruijn graph is a compact representation based on short ‘words’ (k-mers) that is ideal for high coverage of short read (25-50 bp) data sets. When applied to Illumina data sets without read pairs, Velvet generates contigs of approximately 8 kb in a prokaryote and 2 kb in a mammalian. Thus, Velvet is critical for our use in de novo assembly because parameters can be custom set based on the data at hand and this tool works better in our experience for this purpose than the CLC Workbench and GenomeStudio.