How the Protein Calculator calculates charges and pI's




Introduction:



There is apparently a tremendous amount of confusion about how to estimate charges of individual proteins given only a single amino acid sequence. There also appears to be some confusion about how seriously to take these types of calculations.

So first let me say that the only way to get a measurement of a pI of a protein is to actually do the experiment (such as by an isoelectric focusing gel or measurements of pKa's of individual residues by NMR techniques). Barring this, there are fairly sophisticated calculations that can be done knowing the three-dimensional structure of the protein. Search the literature for studies done by Don Bashford for such examples. However, all of these techniques require more than simply the sequence of the protein.

Given this, the calculation that can be performed using only the single amino acid sequence (given that the protein folding problem isn't solved), is going to be wrong. The reason is that the calculations describe below make the incorrect assumptions that all of the titratable sites on the protein are both isolated from each other and have a pKa that is not influenced by the local protein environment. Further, pKa's should also be influenced by other factors such as total ionic strength of the solution and the like. Hydrogens are not the only positively charged things that bind to carboxylates. So we know that the assumptions going into this calculation are wrong. Thus, any calculation that depends on them must also be wrong.

However, this calculation isn't so wrong that the answers aren't useful. It mostly depends on the type of question that you are asking. If you want to pick a pH that avoids the pI of a protein or choose an ion-exchange column will or won't bind your protein, then the calculation is useful. But if you want to make sophisticated arguments regarding the charge states of a protein, you better do the measurements.

I've seem to have made some users angry by:

  • not reporting pI's to more significant digits
  • telling them that they can't trust these results for what they were trying to use them for

I'm sorry if they are upset, but I don't want to see people trusting these numbers too greatly. Perhaps it's no longer important to emphasize this, but just because it came from a computer doesn't make it right.




The Calculation Part One





The first part of the problem is to estimate the charge of a residue at a particular pH. This can be calculated very easily given the fact that the pKa of a residue is the pH at which the residue is 50% protonated and 50% deprotonated. Simple algebra using equilibrium arguments can lead to an expression that estimates the fraction of residues of that type that are postivitely charged (for things like lysine and arginine) or are negatively charged (for things like aspartates and glutamates) at a particulark pH. Also, in the calculation, don't forget the N- and C-terminus of the protein. This algebraic rearrangement has a name (which I didn't know when I derived it), the Henderson-Hasselbalch equation [C Tanford and JG Kirkwood, 1957, JACS 79: 5333-5347].

So given that you know from the sequence how many, say histidines there are (n), and the fractional charge of residues with a histidine pKa (f) using the above calculation, the charge due to the histidines is simply nf. Now repeat for all other types of residues, keeping track of whether or not the protonated or deprotonated forms of the residues contribute to the charge.

The pKa's being used (as mentioned in the output) are from Stryer's Biochemistry 3rd edition: (N-terminus 8.0, C-terminus 3.1, Lysine 10.0, Arginine 12.0, Histidine 6.5, Glutamic acid 4.4, Aspartic acid 4.4, Tyrosine 10.0, and Cysteine 8.5). One might argue that there are better source for the pKa's of isolated residues, but see the introduction for reasons why getting much better measurements for these residues won't improve the accuracy of the calculation significantly.

So part one is done. We can calculate a estimated charge for the protein at any particular pH.




The Calculation Part Two





So given the ability to calculate an estimated charge for a particular pH, now we would like to calculate an estimated pI for the protein. The charge as a function of pH is a nice, continuous function. So we can trivially find the pI (the pH where the estimated charge is zero) via the bisection method.

We start by finding a pH where the protein is postively charged and a pH where the protein is negatively charged. The pI is in that range. Then we pick a pH in the middle. Figure out if the protein is positively charged or negatively charged in the middle, and then replace the appropriate endpoint for a smaller range for where the pI is. Wash, rinse and repeat until the range is as small as you want. I don't push the accuracy to too many decimal places simply because the assumptions behind part one of the calculation are wrong so getting more than a few digits of accuracy is a numerical excercise that is physically meaningless (which is why the output is truncated to tenths of a pH unit).




The End





And that's it. The calculation that's being performed is very simple, but frankly I'm at a loss of doing a better job given only an arbitrary protein sequence. (Some users have wondered whether or not this program actually looks up sequences from some database. Frankly, I'm not sure there are enough measured pI's to make it worth it, and this in my mind would tremendously limit the usefulness of the calculation while adding a horrendous burden on me to try an generate an maintain such a database.)






Document version 0.1 10/22/05