| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |


* Shanghai Research Centre of Biotechnology, Chinese Academy of Sciences, Shanghai 200233, China;
Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115; and
Upjohn Laboratories, Pharmacia, Kalamazoo, Michigan 49007
Correspondence: Address reprint requests to Yu-Dong Cai, Biomolecular Sciences Dept., UMIST, P.O. Box 88, Manchester M60 1QD, UK. E-mail: y.cai{at}umist.ac.uk.
| ABSTRACT |
|---|
|
|
|---|
| INTRODUCTION |
|---|
|
|
|---|
|
| PSEUDO-AMINO ACID COMPOSITION AND FUNCTIONAL DOMAIN COMPOSITION |
|---|
|
|
|---|
discrete numbers (Chou, 2001
)-D space, as given by
![]() | (1) |
are related to
different ranks (Fig. 2) of sequence-order correlation factors as formulated by the following equation (Chou, 2002
![]() | (2) |
1 is called the first-rank coupling factor that harbors the sequence-order correlation between all the most contiguous residues along a protein chain (Fig. 2 a),
2 the second-rank coupling factor that harbors the sequence-order correlation between all the second most contiguous residues (Fig. 2 b),
3 the third-rank coupling factor that harbors the sequence-order correlation between all the third most contiguous residues (Fig. 2 c), and so forth. The coupling factor Ji,j in Eq. 2 is a function of amino acids Ri and Rj, such as the physicochemical distance (Schneider and Wrede, 1994
1,
2,
3, ..., t
, as defined by Eq. 2. Accordingly, the first 20 components of Eq. 1 reflect the effect of the amino acid composition, while the components from 20 + 1 to 20 +
reflect the effect of sequence order. A set of such 20 +
components as formulated by Eqs. 12 is called the pseudo-amino acid composition for protein P. Using such a name is because it still has the main feature of amino acid composition, but on the other hand, it contains the information beyond the conventional amino acid composition. The pseudo-amino acid composition thus defined has the following advantage: compared with the 210-D pair-coupled amino acid composition (Chou, 1999
|
![]() | (3) |
![]() | (4) |
)-D space of the pseudo-amino acid composition (Chou, 2001| SUPPORT VECTOR MACHINES |
|---|
|
|
|---|
SVMs have been used to deal with protein fold recognition (Ding and Dubchak, 2001
), protein-protein interactions prediction (Bock and Gough, 2001
), and protein secondary structure prediction (Hua and Sun, 2001
).
In this article, the Vapnik's Support Vector Machine (Vapnik 1995
) was introduced to predict the types of membrane proteins. Specifically, the SVMlight, which is an implementation (in C Language) of SVM for the problems of pattern recognition, was used for computations. The optimization algorithm used in SVMlight can be found in Joachims (1999)
. The relevant mathematical principles can be briefly formulated as follows. Given a set of N samples, i.e., a series of input vectors
![]() | (5) |
can be regarded as the kth protein or vector defined in the 2005-D space according to the functional domain composition, and
is a Euclidean space with d dimensions. Since the multiclass identification problem can always be converted into a two-class identification problem, without loss of the generality the formulation below is given for the two-class case only. Suppose the output derived from the learning machine is expressed by hk
{;+1,-1} (k = 1, ..., N), where the indexes -1 and +1 are used to stand for the two classes concerned, respectively. The goal here is to construct one binary classifier or derive one decision function from the available samples that has a small probability of misclassifying a future sample. Here, both the basic linear separable case, and the most useful linear nonseparable case for most real life problems, are taken into consideration.
The linear separable case
In this case, there exists a separating hyper-plane whose function is
, which implies:
![]() | (6) |
, and using the Karush-Kuhn-Tucker conditions (Cristianini and Shawe-Taylor, 2000
![]() | (7) |
![]() | (8) |
![]() | (9) |
![]() | (10) |
are these
called the support vectors. Now suppose P is a query protein defined in the same 2005-D space based on the functional domain composition. After the SVM has been trained, the decision function for identifying which class the query protein belongs to can be formulated as:
![]() | (11) |
0 or
0, respectively.
The linear nonseparable case
For this case, two important techniques are needed that are given below respectively.
The "soft margin" technique
To allow for training errors, Cortes and Vapnik (1995)
introduced the slack variables
![]() | (12) |
![]() | (13) |
![]() | (14) |
The kernel substitution technique
The SVM performs a nonlinear mapping of the input vectors from the Euclidean space
into a higher dimensional Hilbert space H, where the mapping is determined by the kernel function. Then like in the linear separable case, it finds the optimal separating hyper-plane in the Hilbert space H that would correspond to a nonlinear boundary in the original Euclidean space. Two typical kernel functions are listed below:
![]() | (15) |
![]() | (16) |
![]() | (17) |
![]() | (18) |
![]() | (19) |
Accordingly, the form of the decision function is given by
![]() | (20) |
For a given data set, only the kernel function and the regularity parameter c must be selected to specify the SVM.
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
|
The demonstration was conducted by three different approaches, the resubstitution test, jackknife test, and independent data set test, as reported below.
Resubstitution test
The so-called resubstitution test is an examination for the self-consistency of an identification method. When the resubstitution test is performed for the current study, the type of each membrane protein in a data set is, in turn, identified using the rule parameters derived from the same data set, the so-called training data set. The success rate thus obtained for the 2059 membrane proteins is summarized in Table 1, from which we can see that the overall success rate is 93.9%, indicating that after being trained, the SVMs model has grasped the complicated relationship between the functional domain composition and the types of membrane proteins. However, during the process of the resubstitution test, the rule parameters derived from the training data set include the information of the query protein later plugged back in the test. This will certainly underestimate the error and enhance the success rate because the same proteins are used to derive the rule parameters and to test themselves. Accordingly, the success rate thus obtained represents some sort of optimistic estimation (Cai, 2001
; Chou, 1995
; Chou and Elrod, 1999b
; Zhou and Assa-Munt, 2001
). Nevertheless, the resubstitution test is absolutely necessary because it reflects the self-consistency of an identification method, especially for its algorithm part. An identification algorithm certainly cannot be deemed as a good one if its self-consistency is poor. In other words, the resubstitution test is necessary but not sufficient for evaluating an identification method. As a complement, a cross-validation test for an independent testing data set is needed because it can reflect the effectiveness of an identification method in practical application. This is especially important for checking the validity of a training database to determine whether it contains sufficient information to reflect all the important features concerned so as to yield a high success rate in application.
Jackknife test
As is well known, the independent data set test, subsampling test, and jackknife test are the three methods often used for cross-validation in statistical prediction. Among these three, however, the jackknife test is deemed as the most effective and objective one; see, for example, Chou and Zhang (1995)
for a comprehensive discussion about this, and Mardia et al. (1979)
for the mathematical principle. During jackknifing, each membrane protein in the data set is in turn singled out as a tested protein and all the rule parameters are calculated based on the remaining proteins. In other words, the type of each membrane protein is identified by the rule parameters derived using all the other membrane proteins except the one which is being identified. During the process of jackknifing both the training data set and testing data set are actually open, and a protein will in turn move from one to the other. The results of jackknife test thus obtained for the 2059 membrane proteins are also given in Table 1.
Independent data set test
Moreover, as a demonstration of practical application, predictions were also conducted for the 2625 independent membrane proteins based on the rule parameters derived from the 2059 proteins in the training data set. The 2625 independent proteins were also taken from Chou and Elrod (1999a)
, of which 478 are type I transmembrane proteins, 180 type II transmembrane proteins, 1867 multipass transmembrane proteins, 14 lipid-chain anchored membrane proteins, and 86 GPI anchored membrane proteins. The predicted results thus obtained are also given in Table 1.
From Table 1 the following can be observed. 1), The success prediction rates, by both the functional domain composition approach and the pseudo-amino acid composition approach, are significantly than those by the other approaches. This is fully consistent with what is expected because both these two approaches bear some sequence-order effects, although by means of different avenues. 2), A comparison between the functional domain composition approach and the pseudo-amino acid composition approach indicates that the success rates by the former are
36% higher than those by the latter in the self-consistency test and jackknife test, indicating the current functional domain composition approach is very promising with a high potential for further development. However, it had a remarkable setback in predicting the 2065 independent proteins: the success rate is 20% lower that that by the pseudo-amino acid composition approach. The setback might be due to the reason that the functional domain database used in the current study is far from a complete one yet. Accordingly, many of the 2065 independent proteins cannot be effectively defined based on the current limited functional domain database. It is anticipated that with the continuous improvement of the functional domain database, the setback would be naturally overcome. 3), The goal of this study is not to determine the possible upper limit of the success rate for membrane protein type predictions, but to propose a novel and different approach to incorporate the sequence-order effect because this is both vitally important and a notoriously difficult task in this area, and so far only the pseudo-amino acid composition approach (Chou, 2001
) has been proved really useful, widely applied in various sequence-based (both protein and DNA) prediction projects. Also, it is too premature to construct a complete or quasi-complete training data set based on the protein sequences available so far. Without a complete or quasi-complete training data set, any attempt to determine such an upper limit would be unjustified, and the result thus obtained might be misleading no matter how powerful the prediction algorithm is.
| CONCLUSION |
|---|
|
|
|---|
)-D pseudo-amino acid composition space (Chou, 2001Submitted on July 16, 2002; accepted for publication December 20, 2002.
| REFERENCES |
|---|
|
|
|---|
Bock, J. R., and D. A. Gough. 2001. Predicting protein-protein interactions from primary structure. Bioinformatics. 17:455460.
Cai, Y. D. 2001. Is it a paradox or misinterpretation? Protein Struct. Funct. Genet. 43:336338.
Casey, P. J. 1995. Protein lipidation in cell signalling. Science. 268:221225.
Cedano, J., P. Aloy, J. A. P'erez-pons, and E. Querol. 1997. Relation between amino acid composition and cellular location of proteins. J. Mol. Biol. 266:594600.[Medline]
Chou, K. C. 1995. A novel approach to predicting protein structural classes in a (201)-D amino acid composition space. Protein Struct. Funct. Genet. 21:319344.
Chou, K. C. 1999. Using pair-coupled amino acid composition to predict protein secondary structure content. J. Protein Chem. 18:473480.[Medline]
Chou, K. C. 2000. Prediction of protein subcellular locations by incorporating quasi-sequence-order effect. Biochem. Biophys. Res. Commun. 278:477483.[Medline]
Chou, K. C. 2001. Prediction of protein cellular attributes using pseudo-amino-acid-composition. Protein Struct. Funct. Genet. 43:246255.
Chou, K. C. 2002. A new branch of proteomics: prediction of protein cellular attributes. In Gene Cloning and Expression Technologies. P. W. Weinrer, and Q. Lu, editors. Eaton Publishing, Westborough, MA. pp. 5770.
Chou, K. C., and D. W. Elrod. 1999a. Prediction of membrane protein types and subcellular locations. Protein Struct. Funct. Genet. 34:137153.
Chou, K. C., and D. W. Elrod. 1999b. Protein subcellular location prediction. Protein Eng. 12:107118.
Chou, K. C., W. Liu, G. M. Maggiora, and C. T. Zhang. 1998. Prediction and classification of domain structural classes. Protein Struct. Funct. Genet. 31:97103.
Chou, K. C., and C. T. Zhang. 1994. Predicting protein folding types by distance functions that make allowances for amino acid interactions. J. Biol. Chem. 269:2201422020.
Chou, K. C., and C. T. Zhang. 1995. Prediction of protein structural classes. Crit. Rev. Biochem. Mol. Biol. 30:275349. (Review.)[Medline]
Chou, P. Y. 1980. Amino acid composition of four classes of proteins. Abstracts of Papers, Part I, Second Chemical Congress of the North American Continent, Las Vegas.
Chou, P. Y. 1989. Prediction of protein structural classes from amino acid composition. In Prediction of Protein Structure and The Principles of Protein Conformation. G. D. Fasman, editor. Plenum Press, New York. pp. 549586.
Cortes, C., and V. Vapnik. 1995. Support vector networks. Machine Learning. 20:273293.
Cristianini, N., and J. Shawe-Taylor. 2000. Support Vector Machines. Cambridge University Press, Cambridge.
Ding, C. H., and I. Dubchak. 2001. Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics. 17:349358.
Hua, S. J., and Z. R. Sun. 2001. A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. J. Mol. Biol. 308:397407.[Medline]
Joachims, T. 1999. Making large-scale SVM learning practical. In Advances in Kernel MethodsSupport Vector Learning. B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors. MIT Press. pp. 169184.
Karush, W. 1939. Minima of functions of several variables with inequalities as side constraints. University of Chicago, Chicago, IL. (M.Sc. thesis.)
Liu, W., and K. C. Chou. 1998. Prediction of protein structural classes by modified Mahalanobis discriminant algorithm. J. Protein Chem. 17:209217.[Medline]
Liu, W., and K. C. Chou. 1999. Protein secondary structural content prediction. Protein Eng. 12:10411050.
Mardia, K. V., J. T. Kent, and J. M. Bibby. 1979. Multivariate Analysis. Academic Press, London. pp. 322, 381.
Murvai, J., K. Vlahovicek, E. Barta, and S. Pongor. 2001. The SBASE protein domain library, release 8.0: a collection of annotated protein sequence segments. Nucleic Acids Res. 29:5860.
Nakashima, H., K. Nishikawa, and T. Ooi. 1986. The folding type of a protein is relevant to the amino acid composition. J. Biochem. 99:152162.
Reinhardt, A., and T. Hubbard. 1998. Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Res. 26:22302236.
Resh, M. D. 1994. Myristylation and palmitylation of Src family members: the fats of the matter. Cell. 76:411413.[Medline]
Rost, B., R. Casadio, P. Fariselli, and C. Sander. 1995. Transmembrane helices predicted at 95% accuracy. Protein Sci. 4:521533.[Abstract]
Schneider, G., and P. Wrede. 1994. The rational design of amino acid sequences by artificial neural networks and simulated molecular evolution: de novo design of an idealized leader peptidase cleavage site. Biophys. J. 66:335344.[Medline]
Vapnik, V. 1998. Statistical Learning Theory. Wiley-Interscience, New York.
Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer-Verlag.
Wolfe, P. 1961. A duality theorem for nonlinear programming. Quart. Applied Math. 19:239244.
Zhou, G. P. 1998. An intriguing controversy over protein structural class prediction. J. Protein Chem. 17:729738.[Medline]
Zhou, G. P., and N. Assa-Munt. 2001. Some insights into protein structural class prediction. Protein Struct. Funct. Genet. 44:5759.
This article has been cited by other articles:
![]() |
K. Lee, D.-W. Kim, D. Na, K. H. Lee, and D. Lee PLPD: reliable protein localization prediction from imbalanced and overlapped datasets Nucleic Acids Res., October 18, 2006; 34(17): 4655 - 4666. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Bhasin and G. P. S. Raghava GPCRsclass: a web tool for the classification of amine type of G-protein-coupled receptors Nucleic Acids Res., July 1, 2005; 33(suppl_2): W143 - W147. [Abstract] [Full Text] [PDF] |
||||
![]() |
K.-C. Chou and Y.-D. Cai Predicting protein localization in budding Yeast Bioinformatics, April 1, 2005; 21(7): 944 - 950. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Wang, J. Yang, G.-P. Liu, Z.-J. Xu, and K.-C. Chou Weighted-support vector machines for predicting membrane protein types based on pseudo-amino acid composition Protein Eng. Des. Sel., June 1, 2004; 17(6): 509 - 516. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |