Cravatt-lab guide to DTASelect:

DTASelect is a program for filtering, grouping, and interpreting the results of proteomics data. You can find the original publication here and you can also download the manual for DTASelect v1.9 here which includes lots of helpful background information and explains much of the functionality of DTASelect in great detail. However, it doesn't help you much with installation and is many years old and, as such, doesn't mention any of the new features that have been added since 2004. This page is meant to fill in a few of those holes.

How to install DTASelect

To install DTASelect (either version) you must go to the Yates' lab software download page here: link. Download either version 1.9 or 2.0 -- you can read about the difference between these versions below. Both versions are free but you must agree to a Liceense Agreement and supply them with your name and details.

That's it! DTASelect should now be installed. If you open a command prompt and type "dtaselect" <enter> it should give you an error message about not being able to read the sequest.params file.

How to run DTASelect

To run DTASelect you need two things in a folder: your .SQT file (or .OUT files) and a sequest.params file. You can download a sample sequest.params file here. Once these two things are in place, follow these instructions:

Difference between DTASelect v1.9 and DTASelect v2.0:

DTASelect 1.9 does a very good job and is simple to use but relatively primitive with respect to modern statistical methods for false-positive rate estimation. Furthemore, careful use of DTASelect 2.0 should lead to slightly improved sensitivity by enabling fine-tuning of quality thresholds and parameters within acceptible accuracy limits. However, DTASelect 2.0 is also a bit more sophisticated and the parameters should be used with care. For more information about what, precisely, the difference are, read on:

DTASelect version 2.0 is a new version that includes several advanced statistical features for empirical determination of peptide false-discovery rates. The basis for these features is the use of reverse-concatenated protein databases. Reversed or shuffled protein sequences serve as a decoy with which quality thresholds can be empirically derived to maximize sensitivity and accuracy. To determine the accuracy of a particular parameter-set, SEQUEST is run using a reverse-concatenated database and DTASelect is used to filter the SEQUEST results according to certain quality thresholds (like XCorr, deltaCN, etc). One can then count up the number of decoy peptides that were identified. Theory states that if 5 out of 100 peptides identified are from the decoy database then, on average, 5 of your 95 forward peptides are also likely mis-identified (see Elias & Gygi, Nat Methods 2007 for details). Thus, one can empirically estimate the false positive rate given a certain set of DTASelect parameters.

The traditional way of running DTASelect was to use fixed XCorr (e.g., +1 1.8, +2 2.5, +3 3.5) and deltaCN (0.08) thresholds which were chosen to ensure only high-quality peptide matches. These parameters are quite stringent and rarely lead to false-positive identifications (typical false-positive rates with these parameters are below 1%). That being the case, it is often possible to relax those thresholds to identify more true peptides without adversely affecting your false-positive rate, but manual adjustment of thresholds is tedious and difficult to do accurately. DTASelect2 employs a linear discriminant function to empirically determine which parameter-values should be chosen to achieve a desired false-positive rate. So, rather than explicitly specifying what XCorr values should be used (or relying on the default values) the user explicitly tells DTASelect2 to achieve a certain false-positive rate (the default is 5%) and DTASelect2 will analyze the distribution of decoy- and forward-peptides as relates to various quality scores (like XCorr) and empirically determine how to set these parameters to achieve the desired false-positive rate.
For more info see Keller 2002 Anal. Chem.

So, if you run DTASelect v1.9 without any extra parameters, it will use the default XCorr and deltaCN values (1.8, 2.5, 3.5, 0.08 respectively) to decide which peptides are correct versus those that are likely incorrect. But if you run DTASelect v2.0 without any additional paramters, it will attempt to identify reverse sequences in your database (by default it looks for entries in your FASTA file that begin with "Reverse_") and use a linear discriminant function to achieve a false-positive rate of 5% (by default). If you run DTASelect2 and your database does not contain reverse-concatenated entries (or DTASelect2 cannot find them because they don't begin with "Reverse_") then DTASelect will appear to complete successfully, but there will be no peptides matched. To run DTASelect2 without statistics (the way DTASelect 1.9 works) use the "--nostats" option.

The default peptide false-postive rate in DTASelect2 is 5% (0.05). To adjust this use the "--fp" switch. For example for a false-positive rate of 1% use "--fp 0.01". When evaluating false-positive rates it is imperative to check the actual false positive rates by looking at the bottom of the DTASelect.html file in the table. DTASelect2 will try to achieve the desired false-positive rate, but sometimes does a poor job so one must keep an eye on the actual false-positive rate as well.

When running DTASelect2, you'll notice that it performs three separate analyses for the charge states +1, +2, and +3. This is because peptides with different charge states have different inherent qualities and should not be evaluated together. Similarly, peptides with different tryptic status should not be evaluated together because fully-tryptic peptides fragment somewhat differently (better) than semi- or non-tryptic peptides. As such, it is recommended that you perform your DTASelect2 analyses using the "--trypstat" option which performs separate linear discriminant anlyses for each tryptic status and thus performs 9 separate analyses instead of 3 (3 charge states x 3 potential tryptic statuses = 9). This usually has the added advantage of increasing your fully-tryptic peptide results because the majority of false-positive matches are usually semi-tryptic so the stringency for semi-tryptic peptides is increased which allows you to relax the stringency for fully-tryptic peptides. Similarly, if you are looking for post-translationally modified peptides it is advisable to also use the "--modstat" option which treats modified and unmodified peptides separately.

Common parameters for DTASelect 1.9:

dtaselect -l keratin -o
-l keratinremoves all entries whose identifier matches 'keratin'
-o makes peptide matches parsimonious to avoid redundant matches

Common parameters for DTASelect 2.0:

dtaselect --trypstat --fp 0.02
--trypstatinclude cleavage status when evaluating statistics
--fp 0.02try to achieve a 2% peptide false-discovery rate

Additional useful options:

--helpview all DTASelect options
--modstatinclude modification status when evaluating statistics (only DTASelect 2.0)
--massuse delta mass (ppm) for statistics [recommended for Orbi data] (only DTASelect 2.0)
--nostatsdo not perform any statistics -- operate in v1.9 mode (only DTASelect 2.0)
--hidedecoydo not show decoy hits in output files (only DTASelect 2.0)
-p 1require only 1 peptide per locus (default is 2). This can DRAMATICALLY increase your actual false-positive rate. Use with care.
-y 2require peptides to be fully-tryptic (-y 1 requires half-tryptic, -y 0 is unrestricted)


Useful scripts and databases

All of these scripts are written in Perl so the first step is to install ActivePerl on your Windows PC. Download and run this file to install ActivePerl v5.8.8, or get the latest (free) build directly from ActiveState. (To install the activeperl-5-8-8.msi file, save it to your Desktop and right-click it and select 'Install').

Download these helpful scripts. Place the desired script into the appropriate folder, as described below. The archive consists of the following:


Here are two relatively non-redundant, reverse-concatenated variants of the IPI database (right-click and select 'save as...'):