Using multitask classification methods to investigate the kinase-specific phosphorylation sites
© Gao et al; licensee BioMed Central Ltd. 2012
Published: 21 June 2012
Identification of phosphorylation sites by computational methods is becoming increasingly important because it reduces labor-intensive and costly experiments and can improve our understanding of the common properties and underlying mechanisms of protein phosphorylation.
A multitask learning framework for learning four kinase families simultaneously, instead of studying each kinase family of phosphorylation sites separately, is presented in the study. The framework includes two multitask classification methods: the Multi-Task Least Squares Support Vector Machines (MTLS-SVMs) and the Multi-Task Feature Selection (MT-Feat3).
Using the multitask learning framework, we successfully identify 18 common features shared by four kinase families of phosphorylation sites. The reliability of selected features is demonstrated by the consistent performance in two multi-task learning methods.
The selected features can be used to build efficient multitask classifiers with good performance, suggesting they are important to protein phosphorylation across 4 kinase families.
Protein phosphorylation, one of the most important forms of post-translational modification of proteins, occurs on several different types of amino acid substrates. Serine (S) phosphorylation is the most common, followed by threonine (T) and tyrosine (Y). Histidine and aspartate phosphorylation may also occur, but mostly in prokaryotes as part of two-component signalling transduction systems  or rarely in some eukaryotic signal transduction pathways .
Protein kinases, which catalyze phosphorylation, play critical roles in the regulation of the majority of cellular pathways, including metabolism, signal transduction, transcription, translation, cell growth, and cell differentiation. Protein kinases account for approximately 2% of known human proteins, but they are responsible of phosphorylating approximate 30% of known human proteins . Moreover, nearly half of human kinases are located in disease loci (such as asthma and autoimmunity) or cancer amplicons . All protein kinases are often classified into several categories based on their substrate specificity. Serine/threonine (S/T) kinases, the most common category, are further classified into a number of kinase families, including cyclin-dependent kinase (CDK), casein kinase 2 (CK2), protein kinase A (PKA), and protein kinase C (PKC).
In recent years, identification of phosphorylation sites by computational methods is becoming increasingly important, with the growing gap between protein sequences information and annotated phosphorylation information of proteins with known sequences. That is due to still lack of high throughput experimental methods for identifying the phosphorylation sites of proteins and current technologies are labor-intensive and costly. Besides predicting phosphorylation sites, computational approaches can also be used to discover the common and specific features of different kinase groups.
A large number of computational tools for predicting phosphorylation sites have been reported . These methods can be roughly grouped into two categories: kinase-specific predictors (e.g. Scansite , PredPhospho , PHOSITE , NetPhosK , GPS, KinasePhos , PPSP ) and non-specific predictors (e.g. NetPhos , DISPHOS ). Given a protein sequence, the non-specific methods can only predict whether a candidate site is a phosphorylation site or not, while kinase-specific methods can not only predict whether it is a phosphorylation site but also assign it to a specific kinase or a specific kinase family. Recently Ji et al. assessed 15 predictors and combined them to build a meta-predictor method named MetaPred . The performance of MetaPred exceeded that of all these 15 member predictors in predicting kinase-specific phosphorylation sites across 4 kinase families. Like all meta-predictors, however, the performance of MetaPred depends on its member primary predictors. Moreover, it is impossible to evaluate the importance of individual features since different primary predictors use different sets of features.
All current kinase-specific phosphorylation prediction methods are single-task learning methods (STL) because they are trained independent from each other. Such methods are optimized on individual training datasets and thus the commonalities between different datasets are not considered. In this study, we use Multi-Task Learning (MTL) methods, instead of STL methods in previous studies, to investigate the kinase-specific phosphorylation sites by learning all STs simultaneously. Using a shared representation, MTL learns all participated STs of a problem by a global optimization approach based on an intuitive idea: the common knowledge shared by related STs in a specific domain helps improving the performance . It has been empirically and theoretically demonstrated that MTL can improve learning performance, compared to learning STs separately . In addition, MTL can be used to find the common knowledge and perform feature selection to identify significant features shared by member STs. MTL is particularly suitable for learning many STs with scarce data , which is currently considered as a major problem in the bioinformatics field. Recently, MTL has been successfully applied to study several biological problems, such as gene expression analysis , subcellular location of proteins , and prediction of siRNA efficacy .
In this study, we apply two MTL methods, namely the Multi-Task Least Squares Support Vector Machines (MTLS-SVMs) and the Multi-Task Feature Selection (MT-Feat3) to the data of 4 kinase families with phosphorylation sites using datasets collected by Ji et al . MT-Feat3 is used to efficiently select features and MTLS-SVMs is then used to build classifiers to do cross validation.
As results, we identify 18 non-redundant common features, which are deemed as important to protein phosphorylation across 4 kinase families. Compared to the initial set of 560 features, the number of features used in the new predictor is reduced by more than 96% without deteriorating the performance. Based on those selected features, future work can be done to reveal some common mechanisms of phosphorylation by different kinase groups.
The dataset MetaPS06 used in this study was downloaded . It consists of 4 kinase family datasets including CDK, CK2, PKA, and PKC. For each kinase family dataset, positive samples are known phosphorylation sites, identified by experiments and belong to that family, while negative samples are non-phosphorylation sites or phosphorylation sites belonging to other families. Furthermore, multi-kinases phosphorylation sites were excluded in all datasets . The numbers of positives/negatives in the final kinase family datasets are 294/441 (CDK), 229/343 (CK2), 360/540 (PKA), and 348/522(PKC).
Feature extraction and peptide encoding
In this study, we use 560 features (physicochemical properties) of twenty amino acid residues. Among them, 544 features were obtained from AAindex database  and the remaining 16 features were collected from published literatures. All features are normalized to a range from 0 to 1.
A fixed length window is applied to scan a peptide sequence. The window size is optimized using odd numbers from 3 to 21. The average of features of all amino acids in a fixed window is assigned to the middle amino acid of the window. Thus the i th peptide is represented by N features in the form , where N is 560.
SVMs, RF and LS-SVMs
Support vector machines (SVMs) derive parameters of the maximum-margin to construct an optimized separating hyperplane. The optimization of SVM classifiers includes the selection of kernel, optimization of the kernel's parameters and soft margin parameter C.
Random Forest (RF) is an ensemble machine learning method that utilizes many independent decision trees to perform classification or regression. Each of member trees is built on a bootstrap sample from the training data by a random subset of available variables.
where is the sample, y i is its corresponding label, N is the sample number, e i is the error, is the vector of weights, ϕ() is the non-linear mapping function, γ and b are parameters to be fitted.
where T is the task number, N t is the sample number of the tth task, is the common weights shared by T single tasks, is the weights for the tth task, is the ith sample of the tth task, y ti is its corresponding label, ϕ() is the non-linear mapping function, λ, γ and b are parameters to be fitted.
MT-Feats (Multi-Task Feature Learning and Selection) algorithm was derived from a MTL framework, which was designed to learn sparse representation shared cross STs from the training data . MT-Feats algorithm originally includes two algorithms to solve the regression problems. The first one was developed for feature learning and the second was for feature selection.
Where TP and TN denote the total number of correctly classified positive and negative samples across all the STs. FP and FN denote the total number of incorrect classified positive and negative samples across all the STs. Since the datasets are relatively balanced, the average accuracy is sufficient to measure the performance of various predictors.
Classification of family-specific phosphorylation sites by two MTL methods
Average classification accuracy of different classifiers with 560 features
Optimized window sizes for 4 kinase family
Classification accuracy of different classifiers with 560 features for 4 kinase datasets
CDK kinase family
CK2 kinase family
PKA kinase family
PKC kinase family
Feature selection and validation
Feature selection can improve the performance of classifiers not only in delivering faster and more effective classifiers but also in providing better understanding of relevant biological processes. MT-Feat3 is capable of selecting common features across multi tasks in addition to performing classification. We firstly construct a weight matrix W with a dimension of 560*4 to represent the significance of 560 features across 4 kinase family datasets using a uniform windows size of 7. The MT-Feat3 can significantly reduce the dimension of features by eliminating rows with zero weights. We then compute the 2-norm weight of each non-zero row in W and obtain the significance w i which represents the importance of the ith feature among 4 kinase family datasets. All non-zero features with w i 2 larger than zero are considered as significant common features and their importance is sorted accordingly. In addition, the same procedure of feature selection is conducted using the optimized window sizes for 4 kinase family datasets (Table 2).
Classification Accuracy of different classifiers with selected features
Analysis of selected features
Selected features by MT-Feat3
Subset 1 (20 features)
Subset 2 (26 features)
Backbone electrostatic interactions
Apparent partition energies
Fractional occurrence in left helix regions
Side chain conformation others
The best aveAc of MTLS-SVMs with the subset 4 is 0.792, very close to that of MTLS-SVMs with total features (0.7936). The best aveAc of MTLS-SVMs with the subset 3 is 0.7455, which is slightly poorer than that of MTLS-SVMs with total features (0.7595) (Table 4). Therefore, those 18 features in subset 4 are considered as significant properties related with protein phosphorylation.
In this study, we use a multi-task learning framework to investigate phosphorylation sites across 4 kinase family datasets. In this framework, MT-Feat3 is used to select some common features, which are then validated by MTLS-SVMs classifiers. Selected features are further reduced to 18 features after eliminating features with high correlation coefficients with outer features. These features are considered as important common features for further analysis of possible properties and mechanisms of protein phosphorylation.
This article has been published as part of Proteome Science Volume 10 Supplement 1, 2012: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2011: Proteome Science. The full contents of the supplement are available online at http://www.proteomesci.com/supplements/10/S1.
We thank Dr. Andreas Argyriou for his helpful discussion. This work was supported in part by the National Institutes of Health (NIH) Grant P01 AG12993 (PI: E. Michaelis).
- Stock AM, Robinson VL, Goudreau PN: Two-component signal transduction. Annu Rev Biochem 2000, 69: 183–215. 10.1146/annurev.biochem.69.1.183PubMedView ArticleGoogle Scholar
- Thomason P, Kay R: Eukaryotic signal transduction via histidine-aspartate phosphorelay. J Cell Sci 2000,113(18):3141–3150.PubMedGoogle Scholar
- Wan J, Kang SL, Tang CN, Yan JH, Ren YL, Liu J, Gao XL, Banerjee A, Ellis LBM, Li TB: Meta-prediction of phosphorylation sites with weighted voting and restricted grid search parameter selection. Nucleic acids research 2008,36(4):e22.PubMed CentralPubMedView ArticleGoogle Scholar
- Manning G, Whyte DB, Martinez R, Hunter T, Sudarsanam S: The protein kinase complement of the human genome. Science 2002,298(5600):1912–1934. 10.1126/science.1075762PubMedView ArticleGoogle Scholar
- Xue Y, Gao XJ, Cao J, Liu ZX, Jin CJ, Wen LP, Yao XB, Ren JA: A Summary of Computational Resources for Protein Phosphorylation. Curr Protein Pept Sci 2010,11(6):485–496. 10.2174/138920310791824138PubMedView ArticleGoogle Scholar
- Obenauer JC, Cantley LC, Yaffe MB: Scansite 2.0: proteome-wide prediction of cell signaling interactions using short sequence motifs. Nucleic acids research 2003,31(13):3635–3641. 10.1093/nar/gkg584PubMed CentralPubMedView ArticleGoogle Scholar
- Kim JH, Lee J, Oh B, Kimm K, Koh IS: Prediction of phosphorylation sites using SVMs. Bioinformatics 2004,20(17):3179–3184. 10.1093/bioinformatics/bth382PubMedView ArticleGoogle Scholar
- Koenig M, Grabe N: Highly specific prediction of phosphorylation sites in proteins. Bioinformatics 2004,20(18):3620–3627. 10.1093/bioinformatics/bth455PubMedView ArticleGoogle Scholar
- Blom N, Sicheritz-Ponten T, Gupta R, Gammeltoft S, Brunak S: Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics 2004,4(6):1633–1649. 10.1002/pmic.200300771PubMedView ArticleGoogle Scholar
- Zhou FF, Xue Y, Chen GL, Yao XB: GPS: a novel group-based phosphorylation predicting and scoring method. Biochemical and Biophysical Research Communications 2004,325(4):1443–1448. 10.1016/j.bbrc.2004.11.001PubMedView ArticleGoogle Scholar
- Huang HD, Lee TY, Tzeng SW, Horng JT: KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites. Nucleic acids research 2005, 33: W226-W229. 10.1093/nar/gki471PubMed CentralPubMedView ArticleGoogle Scholar
- Xue Y, Li A, Wang LR, Feng HQ, Yao XB: PPSP: prediction of PK-specific phosphorylation site with Bayesian decision theory. BMC Bioinformatics 2006, 7: 163. 10.1186/1471-2105-7-163PubMed CentralPubMedView ArticleGoogle Scholar
- Blom N, Gammeltoft S, Brunak S: Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. Journal of molecular biology 1999,294(5):1351–1362. 10.1006/jmbi.1999.3310PubMedView ArticleGoogle Scholar
- Iakoucheva LM, Radivojac P, Brown CJ, O'Connor TR, Sikes JG, Obradovic Z, Dunker AK: The importance of intrinsic disorder for protein phosphorylation. Nucleic acids research 2004,32(3):1037–1049. 10.1093/nar/gkh253PubMed CentralPubMedView ArticleGoogle Scholar
- Caruana R: Multitask Learning: A Knowledge-Based Source of Inductive Bias. Proceedings of the 10th International Conference on Machine Learning 1993, 41–48.Google Scholar
- Argyriou A, Evgeniou T, Pontil M: Multi-Task Feature Learning. NIPS 2006.Google Scholar
- Argyriou A, Micchelli CA, Pontil M, Ying Y: A Spectral Regularization Framework for Multi-Task Structure Learning. NIPS 2007.Google Scholar
- Zhang K, Gray JW, Parvin B: Sparse multitask regression for identifying common mechanism of response to therapeutic targets. Bioinformatics 2010,26(12):i97-i105. 10.1093/bioinformatics/btq181PubMed CentralPubMedView ArticleGoogle Scholar
- Xu Q, Pan SJ, Xue HH, Yang Q: Multitask Learning for Protein Subcellular Location Prediction. IEEE/ACM Trans Comput Biol Bioinform 2011, 8: 748–759.PubMedView ArticleGoogle Scholar
- Liu Q, Xu Q, Zheng VW, Xue H, Cao ZW, Yang Q: Multi-task learning for cross-platform siRNA efficacy prediction: an in-silico study. BMC Bioinformatics 2010, 11: 181. 10.1186/1471-2105-11-181PubMed CentralPubMedView ArticleGoogle Scholar
- Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M: AAindex: amino acid index database, progress report 2008. Nucleic acids research 2008, 36: D202–205. 10.1093/nar/gkn255PubMed CentralPubMedView ArticleGoogle Scholar
- van Gestel T, Suykens JAK, Baesens B, Viaene S, Vanthienen J, Dedene G, de Moor B, Vandewalle J: Benchmarking least squares support vector machine classifiers. Mach Learn 2004,54(1):5–32.View ArticleGoogle Scholar
- Argyriou A, Evgeniou T, Pontil M: Convex multi-task feature learning. Mach Learn 2008,73(3):243–272. 10.1007/s10994-007-5040-8View ArticleGoogle Scholar
- Mead A: Review of the Development of Multidimensional-Scaling Methods. Statistician 1992,41(1):27–39. 10.2307/2348634View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.