 |
|
News and Publications
Prediction of Protein Structure and Function From Sequence
J. Skolnick, A. Kolinski, J.S. Fetrow, A. Godzik, A.R. Ortiz, B. Ilkowski,* D. Mohanty, K. Pawlowski, B. Reva, P. Rotkiewicz,* L. Zhang
* University of Warsaw, Warsaw, Poland
One of the goals of genome sequencing projects is to develop tools for comparing and interpreting the resulting genomic information, in particular, for predicting protein function from sequence. To achieve this objective, we are creating a series of tools based on the sequence structure function model. We are developing algorithms that can be used to predict protein structure from sequence. Included are both ab initio folding tools and threading methods.
With our ab initio folding approaches, low-resolution structures can be predicted for a substantial fraction of small, single-domain proteins. Furthermore, we showed that such low-resolution models can be used to determine active sites in proteins, and thus we can use structural information to predict protein function. This finding suggests a means for the large-scale functional screening of genomic sequence databases that is based on the prediction of structure from sequence and then on the detection of functional active sites in the predicted structure. This situation opens up the possibility of screening entire genomes to identify proteins that have a specified biochemical activity.
PREDICTION OF THE TERTIARY STRUCTURE OF GLOBULAR PROTEINS
By incorporating predicted secondary and tertiary restraints derived from multiple sequence alignments into ab initio folding simulations, we assembled nativelike tertiary structures for a test set of 19 nonhomologous proteins, ranging from 29 to 100 residues and representing all secondary structural classes. Secondary structural restraints are provided by the PHD secondary structure prediction algorithm that incorporates multiple sequence information. Multiple sequence alignments also provide predicted tertiary restraints via a 2-step process: First, seed side-chain contacts are selected from a correlated mutation analysis, and then an inverse folding algorithm is used to expand these seed contacts. The predicted secondary and tertiary restraints are incorporated into a lattice-based, reduced protein model for structure assembly and refinement. The resulting nativelike topologies have a coordinate root-mean-square deviation (cRMSD) from native for the whole chain between 3.1 and 6.7 Å.
Blind predictions of 2 proteins were also successful. One of these, a helical protein, has a cRMSD of 5.7 Å from native; the other, the KIX domain of the CREB binding protein has a cRMSD of 5.8 Å. Finally, for a number of these simplified protein structures, in collaboration with P. Kollman and colleagues, University of California, San Francisco, we used molecular dynamics to build and refine detailed atomic models to produce structures with cRMSDs of 2.3 Å from native.
During the past year, we developed a new approach to reduced representation and Monte Carlo simulation of protein structures. It builds on the well-known fact that intraprotein interactions are rather specific for amino acid side chains and rather generic for the main chain units. Thus, the proposed lattice model of polypeptides assumes explicit representation only for the side chains, and particular side groups are represented as clusters of occupied points on the underlying simple cubic lattice. A new, knowledge-based force field has been developed for this model based on local distance geometry, statistics of side-chain contacts in known protein structures, and some multibody correlations observed in real proteins. The model is a purely lattice type and despite having the same level of resolution as do more complex reduced models, allows for about 100 times faster Monte Carlo sampling, thereby enabling the study of much larger protein systems.
We used the new model in several applications, including self-consistent secondary structure prediction, assembly of protein structure from sparse experimental data, and structure predictions based on distant sequence similarity or sequence-structure compatibility. The last application is briefly outlined in the following section.
The important part of this method is the development of potentials characteristic for a given protein sequence. First, from the structural library, the collection of sequence fragments that are most similar to particular (overlapping) sequence fragments of the protein of interest are selected. These fragments, presumably also similar structurally, are then used to build statistical potentials that describe secondary propensities of the query sequence. In a somewhat similar fashion, the long-range potentials could be derived. Next, a search is made (sequence alignments, threading) for a known structure that is homologous or that has a sequence-structure compatibility to the query sequence. The conserved part of the resulting obtained alignment to a template provides some approximate long-range restraints for folding simulations. Using these ideas, we showed that when the homology is low, this method allows a qualitatively more accurate prediction of structure than does conventional homology modeling methods. For example, conventional automated modeling of plastocyanin with azurin as a template gives an 8-Å model, whereas our method gives a 4-Å model.
The practical exploitation of the vast numbers of sequences in the genome sequence databases is crucially dependent on the ability to determine the function of each sequence. Unfortunately, current methods, including global sequence alignment and local sequence pattern identification, are limited by the extent of sequence similarity between sequences of unknown and known function. These methods increasingly fail as the sequence identity diverges into and beyond the twilight zone of sequence identity. To address this problem, we developed a novel method for determining protein function that is based directly on the sequence structure function model.
Descriptors of protein active sites, termed fuzzy functional forms, are created on the basis of the forms' geometry and conformation. By way of illustration, the active sites responsible for the oxidoreductase activity of the glutaredoxin/thioredoxin family and the RNA hydrolytic activity of the T1 ribonuclease family have been derived. First, the fuzzy functional forms are shown to correctly indicate the active site in a library of exact protein models produced by crystallography or nuclear magnetic resonance spectroscopy, most of which lack the specified activity. Next, these fuzzy functional forms are used to screen for active sites in low-to-moderate resolution models produced by using ab initio folding or threading prediction algorithms. Again, the fuzzy functional forms can specifically indicate the functional sites of these proteins from the predicted structures of the proteins. The results indicate that low-to-moderate resolution models as produced by state-of-the-art tertiary structure prediction algorithms are sufficient to detect protein active sites.
This automated method for screening of protein activity based on the sequence structure function model has been applied to the complete Escherichia coli genome. All E coli open reading frames were screened for the thiol-disulfide oxidoreductase activity of the glutaredoxin/thioredoxin protein family. We showed that the method can be used to detect the active-site residues in 10 sequences that are known to or proposed to exhibit this activity. Furthermore, oxidoreductase activity was predicted in 2 other sequences that had not been detected previously. The method distinguishes protein pairs with similar active sites from protein pairs that are just topological cousins, that is, those having similar global folds but not necessarily similar active sites. Thus, this method provides a novel approach for collecting information on active sites and function that is based on 3-dimensional structures rather than on simple sequence analysis. Prediction of protein activity is fully automated and easily extendible to new functions.
Other activities of our research group include the development of self-consistent field approaches to threading, prediction of the quaternary structure of coiled coils, and simulations of the mechanism of assembly of viral coat protein.
PUBLICATIONS
Fetrow, J., Godzik, A., Skolnick, J. Functional analysis of the Escherichia coli genome using the sequence-to-structure-to-function paradigm: Identification of proteins exhibiting the glutaredoxin/thioredoxin disulfide oxidoreductase activity. J. Mol. Biol., in press.
Fetrow, J., Skolnick, J. Method for prediction of protein function from sequence using the sequence-to-structure-to-function paradigm with application to glutaredoxins/thioredoxins and T1 ribonucleases. J. Mol. Biol. 281:949, 1998.
Hu, W.-P., Kolinski, A., Skolnick, J. An improved method for the prediction of the protein backbone U-turn positions and the major secondary structures between the U-turns. Proteins 29:443, 1997.
Kolinski, A., Galazka, W., Skolnick, J. Monte Carlo studies of the thermodynamics and kinetics of reduced protein models: Application to small helical, ß and /alphapub/ß proteins. J. Chem. Phys. 108:2608, 1998.
Kolinski, A., Jaroszewski, L., Rotkiewicz, P., Skolnick, J. An efficient Monte Carlo model of protein chains: Modeling the short-range correlations between side group centers of mass. J. Chem. Phys. 102:4628, 1998.
Kolinski, A., Rotkiewicz, P., Skolnick, J. Application of a high coordination lattice model in protein structure prediction. In: Proceedings of the Workshop on Monte Carlo Approach to Biopolymers and Protein Folding. World Scientific, River Edge, NJ, in press.
Kolinski, A., Skolnick, J. Assembly of protein structure from sparse experimental data: An efficient Monte Carlo model. Proteins 32:475, 1998.
Ortiz, A., Kolinski, A., Skolnick, J. Combined multiple sequence reduced protein model approach to predict the tertiary structure of small proteins. In: Proceedings of the Pacific Symposium on Biocomputing (PSB-98). Altman, R., et al. (Eds.). World Scientific, River Edge, NJ, 1998, p. 377.
Ortiz, A., Kolinski, A., Skolnick, J. Fold assembly of small proteins using Monte Carlo simulations driven by restraints derived from multiple sequence alignments. J. Mol. Biol. 277:419, 1998.
Ortiz, A., Kolinski, A., Skolnick, J. Nativelike topology assembly of small proteins using predicted restraints in Monte Carlo folding simulations. Proc. Natl. Acad. Sci. U.S.A. 95:1020, 1998.
Ortiz, A., Kolinski, A., Skolnick, J. Tertiary structure prediction of the KIX domain of CBP using Monte Carlo simulations driven by restraints derived from multiple sequence alignments. Proteins 30:287, 1998.
Reva, B., Finkelstein, A., Sanner, M., Olson, A., Skolnick, J. Recognition of protein structure on coarse lattices with residue-residue energy functions. Protein Eng. 10:1123, 1997.
Reva, B., Finkelstein, A., Skolnick, J. Derivation and testing residue-residue mean force potentials for use in protein structure recognition. Methods Mol. Biol., in press.
Reva, B., Finkelstein, A., Skolnick, J. A self-consistent field optimization approach to build energetically and geometrically correct lattice models of proteins. J. Comput. Biol., in press.
Reva, B., Finkelstein, A.V., Skolnick, J. What is the probability of a chance prediction of a protein structure with an RMSD of 6 Å? Folding Design 3:141, 1998.
Sikorski, A., Kolinski, A., Skolnick, J. Computer simulations of de novo designed helical proteins. Biophys. J. 75:92, 1998.
Skolnick, J., Kolinski, A. Monte Carlo approaches to the protein folding problem. In: Monte Carlo Methods in Chemical Physics. Ferguson, D., Siepmann, J.I., Truhlar, D.G. (Eds.). Advances in Chemical Physics Series. Wiley, New York, in press.
Skolnick, J., Kolinski, A. Protein modelling. In: Encyclopedia of Computational Chemistry. Schleyer, P., Kollman, P. (Eds.). Wiley, New York, in press.
Zhang, L., Skolnick, J. How do potentials derived from structural databases relate to "true" potentials? Protein Sci. 7:112, 1998.
Zhang, L., Skolnick, J. Z-score of native protein structures. Protein Sci. 7:1201, 1998.
|
|