- Open Access
Using multitask classification methods to investigate the kinase-specific phosphorylation sites
Proteome Science volume 10, Article number: S7 (2012)
Identification of phosphorylation sites by computational methods is becoming increasingly important because it reduces labor-intensive and costly experiments and can improve our understanding of the common properties and underlying mechanisms of protein phosphorylation.
A multitask learning framework for learning four kinase families simultaneously, instead of studying each kinase family of phosphorylation sites separately, is presented in the study. The framework includes two multitask classification methods: the Multi-Task Least Squares Support Vector Machines (MTLS-SVMs) and the Multi-Task Feature Selection (MT-Feat3).
Using the multitask learning framework, we successfully identify 18 common features shared by four kinase families of phosphorylation sites. The reliability of selected features is demonstrated by the consistent performance in two multi-task learning methods.
The selected features can be used to build efficient multitask classifiers with good performance, suggesting they are important to protein phosphorylation across 4 kinase families.
Protein phosphorylation, one of the most important forms of post-translational modification of proteins, occurs on several different types of amino acid substrates. Serine (S) phosphorylation is the most common, followed by threonine (T) and tyrosine (Y). Histidine and aspartate phosphorylation may also occur, but mostly in prokaryotes as part of two-component signalling transduction systems  or rarely in some eukaryotic signal transduction pathways .
Protein kinases, which catalyze phosphorylation, play critical roles in the regulation of the majority of cellular pathways, including metabolism, signal transduction, transcription, translation, cell growth, and cell differentiation. Protein kinases account for approximately 2% of known human proteins, but they are responsible of phosphorylating approximate 30% of known human proteins . Moreover, nearly half of human kinases are located in disease loci (such as asthma and autoimmunity) or cancer amplicons . All protein kinases are often classified into several categories based on their substrate specificity. Serine/threonine (S/T) kinases, the most common category, are further classified into a number of kinase families, including cyclin-dependent kinase (CDK), casein kinase 2 (CK2), protein kinase A (PKA), and protein kinase C (PKC).
In recent years, identification of phosphorylation sites by computational methods is becoming increasingly important, with the growing gap between protein sequences information and annotated phosphorylation information of proteins with known sequences. That is due to still lack of high throughput experimental methods for identifying the phosphorylation sites of proteins and current technologies are labor-intensive and costly. Besides predicting phosphorylation sites, computational approaches can also be used to discover the common and specific features of different kinase groups.
A large number of computational tools for predicting phosphorylation sites have been reported . These methods can be roughly grouped into two categories: kinase-specific predictors (e.g. Scansite , PredPhospho , PHOSITE , NetPhosK , GPS, KinasePhos , PPSP ) and non-specific predictors (e.g. NetPhos , DISPHOS ). Given a protein sequence, the non-specific methods can only predict whether a candidate site is a phosphorylation site or not, while kinase-specific methods can not only predict whether it is a phosphorylation site but also assign it to a specific kinase or a specific kinase family. Recently Ji et al. assessed 15 predictors and combined them to build a meta-predictor method named MetaPred . The performance of MetaPred exceeded that of all these 15 member predictors in predicting kinase-specific phosphorylation sites across 4 kinase families. Like all meta-predictors, however, the performance of MetaPred depends on its member primary predictors. Moreover, it is impossible to evaluate the importance of individual features since different primary predictors use different sets of features.
All current kinase-specific phosphorylation prediction methods are single-task learning methods (STL) because they are trained independent from each other. Such methods are optimized on individual training datasets and thus the commonalities between different datasets are not considered. In this study, we use Multi-Task Learning (MTL) methods, instead of STL methods in previous studies, to investigate the kinase-specific phosphorylation sites by learning all STs simultaneously. Using a shared representation, MTL learns all participated STs of a problem by a global optimization approach based on an intuitive idea: the common knowledge shared by related STs in a specific domain helps improving the performance . It has been empirically and theoretically demonstrated that MTL can improve learning performance, compared to learning STs separately . In addition, MTL can be used to find the common knowledge and perform feature selection to identify significant features shared by member STs. MTL is particularly suitable for learning many STs with scarce data , which is currently considered as a major problem in the bioinformatics field. Recently, MTL has been successfully applied to study several biological problems, such as gene expression analysis , subcellular location of proteins , and prediction of siRNA efficacy .
In this study, we apply two MTL methods, namely the Multi-Task Least Squares Support Vector Machines (MTLS-SVMs) and the Multi-Task Feature Selection (MT-Feat3) to the data of 4 kinase families with phosphorylation sites using datasets collected by Ji et al . MT-Feat3 is used to efficiently select features and MTLS-SVMs is then used to build classifiers to do cross validation.
As results, we identify 18 non-redundant common features, which are deemed as important to protein phosphorylation across 4 kinase families. Compared to the initial set of 560 features, the number of features used in the new predictor is reduced by more than 96% without deteriorating the performance. Based on those selected features, future work can be done to reveal some common mechanisms of phosphorylation by different kinase groups.
The dataset MetaPS06 used in this study was downloaded . It consists of 4 kinase family datasets including CDK, CK2, PKA, and PKC. For each kinase family dataset, positive samples are known phosphorylation sites, identified by experiments and belong to that family, while negative samples are non-phosphorylation sites or phosphorylation sites belonging to other families. Furthermore, multi-kinases phosphorylation sites were excluded in all datasets . The numbers of positives/negatives in the final kinase family datasets are 294/441 (CDK), 229/343 (CK2), 360/540 (PKA), and 348/522(PKC).
Feature extraction and peptide encoding
In this study, we use 560 features (physicochemical properties) of twenty amino acid residues. Among them, 544 features were obtained from AAindex database  and the remaining 16 features were collected from published literatures. All features are normalized to a range from 0 to 1.
A fixed length window is applied to scan a peptide sequence. The window size is optimized using odd numbers from 3 to 21. The average of features of all amino acids in a fixed window is assigned to the middle amino acid of the window. Thus the i th peptide is represented by N features in the form , where N is 560.
SVMs, RF and LS-SVMs
Support vector machines (SVMs) derive parameters of the maximum-margin to construct an optimized separating hyperplane. The optimization of SVM classifiers includes the selection of kernel, optimization of the kernel's parameters and soft margin parameter C.
Random Forest (RF) is an ensemble machine learning method that utilizes many independent decision trees to perform classification or regression. Each of member trees is built on a bootstrap sample from the training data by a random subset of available variables.
LS-SVMs can be considered as a variant of classical SVMs. LS-SVMs realize the optimization by solving a set of linear equations instead of a convex quadratic programming for SVMs. LS-SVMs perform training faster than SVMs without sacrificing generalization performance . The LS-SVMs classifier is obtained by solving a restricted optimization problem as below (Formula 1).
where is the sample, y i is its corresponding label, N is the sample number, e i is the error, is the vector of weights, ϕ() is the non-linear mapping function, γ and b are parameters to be fitted.
MTLS-SVMs is developed based on the mechanism of data amplification. An MTLS-SVMs classifier learns common parameters by integrating the sub datasets. It is obtained by solving a restricted optimization problem as below (Formula 2), and then the optimization problem can also be solved by solving linear equations.
where T is the task number, N t is the sample number of the tth task, is the common weights shared by T single tasks, is the weights for the tth task, is the ith sample of the tth task, y ti is its corresponding label, ϕ() is the non-linear mapping function, λ, γ and b are parameters to be fitted.
MT-Feats (Multi-Task Feature Learning and Selection) algorithm was derived from a MTL framework, which was designed to learn sparse representation shared cross STs from the training data . MT-Feats algorithm originally includes two algorithms to solve the regression problems. The first one was developed for feature learning and the second was for feature selection.
We modify MT-Feats algorithms to solve classification problems, by using LS-SVMs as element classifiers. MT-Feat1 was developed for feature learning and MT-Feat3 was for feature selection. Both feature learning and feature selection learn common parameters by jointly regularizing a common term (Formula 3).
Where W = UA, other symbols have the same meaning as those in formula 2. If the U is set as identity matrix, the "Feature learning" problem (MT-Feat1) is reduced to a "Feature selection" problem (MT-Feat3). Thus, MT-Feat3 is a special case of MT-Feat1 algorithm (See Formula 4). In this study, we only use MT-Feat3 for feature selection.
Performance is measured by average accuracy (aveAc) which is described in formula 5.
Where TP and TN denote the total number of correctly classified positive and negative samples across all the STs. FP and FN denote the total number of incorrect classified positive and negative samples across all the STs. Since the datasets are relatively balanced, the average accuracy is sufficient to measure the performance of various predictors.
Classification of family-specific phosphorylation sites by two MTL methods
We use MTLS-SVMs and MT-Feat3 methods to build classifiers for predicting phosphorylation sites on 4 kinase family datasets. To compare the performance of MTLS-SVMs and MT-Feat3 methods with that of the STL method, LS-SVMs classifiers are also built using the save datasets. Five-fold cross validation and grid-fitting of parameters are used to estimate the performance of all classifiers with window size from 3 to 21 (Table 1). It can be seen in Table 1 that in general there is an agreement on the average classification accuracy (aveAc) of all three methods on different window sizes and the window size of 7 delivers the best performance for both STL and MTL classifiers. However, apparently the performance of MTL methods (MTLS-SVMs and MT-Feat3) is inferior to the STL (LS-SVM) method. We hypothesize that a uniform window size may be not a good choice for all four kinase family datasets because of the specificity of each kinase. Secondly, there are many redundant or irrelevant features that may decrease the performance. Therefore, in the following work we attempt to improve the performance of MTL classifiers by optimizing window sizes and performing feature selection.
Optimized window sizes for 4 kinase family
For local window based methods, a proper window size reflects the optimized physical or chemical effects on the central amino acid from local surroundings. Different window sizes have been used in previous studies. For example, GPS , KinasePhos , PPSP  used a symmetrical window of 7 consecutive amino acid residues (7-mer), and NetPhosK  used 15-mer and 17 mer. Instead of assuming a uniform window size for all kinase families, we build classifiers based on Support Vector Machines (SVMs) and Random Forest (RF) algorithms to optimize the window size for each of the kinase family dataset. We use ten-fold cross validation and grid fitting of parameters to estimate the performance of all classifiers with 560 features (Table 2). The results clearly show that the performance of both SVMs and RF has very similarly tendency for different window sizes and optimized window sizes are insensitive to the classification algorithms. Generally, SVM models using the linear kernel deliver better performance than SVM models with the rbf kernel and RF models. Using the optimized window sizes respectively presented in Table 2 (3, 17, 7 and 9 for CDK, CK2, PKA and PKC datasets), we build respective models and compare the results with the models using uniform window sizes (Table 1). It is clear that the optimized window sizes significantly improve the performance of LS-SVMs (aveAc = 0.7939), MTLS-SVMs (aveAc = 0.7936), and MT-Feat3 (aveAc = 0.791). In the following parts, window sizes with 3, 17, 7 and 9 for CDK, CK2, PKA and PKC datasets respectively are referred as optimized window sizes.
Feature selection and validation
Feature selection can improve the performance of classifiers not only in delivering faster and more effective classifiers but also in providing better understanding of relevant biological processes. MT-Feat3 is capable of selecting common features across multi tasks in addition to performing classification. We firstly construct a weight matrix W with a dimension of 560*4 to represent the significance of 560 features across 4 kinase family datasets using a uniform windows size of 7. The MT-Feat3 can significantly reduce the dimension of features by eliminating rows with zero weights. We then compute the 2-norm weight of each non-zero row in W and obtain the significance w i which represents the importance of the ith feature among 4 kinase family datasets. All non-zero features with w i 2 larger than zero are considered as significant common features and their importance is sorted accordingly. In addition, the same procedure of feature selection is conducted using the optimized window sizes for 4 kinase family datasets (Table 2).
Using various numbers of the most important features, ranked by the models using either the uniform window or optimized windows, we develop two series of MT-Feat3 models accordingly. In addition, we develop corresponding MTLS-SVM classifiers using the same sets of features. The average accuracies of all models are displayed in Figure 1. Based on the Figure 1, we select 20 features for the models using the window size of 7 and 26 features for the models using optimized window size. The MTLS-SVM models using these sets of features achieve average accuracies of 0.7621 and 0.7962, higher than that (aveAc with 0.7595 and 0.7936) of MTLS-SVMs before feature selection (Table 3). Thus it is clear that feature selection by MT-Feat3 can improve the performance and the performance of MT-Feat3 and MTLS-SVMs is quite consistent. In addition, using optimized windows results in better performance than using a uniform window size of 7 (Table 3). The performance of MTLS-SVM model using the 26 selected features with the optimized window sizes achieves comparable performance to MetaPred (0.7962 vs 0.7997).
Analysis of selected features
The selected features subset 1 (20 features) and subset 2 (26 features) using the uniform window size of 7 or the optimized window sizes, respectively, are listed in Table 4. There are 14 common features appear in both subset 1 and subset 2. These common 14 features can be grouped into 6 categories, including backbone electrostatic interactions ("AVBF000101", "AVBF000102", "AVBF000104", "AVBF000105", "AVBF000106", "AVBF000107", "AVBF000108", "AVBF000109"), hydrophobicity ("ROSM880104", "ROSM880105"), apparent partition energies ("GUYH850103"), negative charge ("FAUJ880112"), fractional occurrence in left helix regions ("RACS820103") and side chain conformation ("YANJ020101").
To investigate the relationship between selected features, we cluster features in the subset 1 (Figure 2A) and subset 2 (Figure 2B) by Pearson correlation coefficients distances and constructed a two-dimensional map (Figure 2A) by the metric multi-dimensional scaling method . All features with high correlation coefficients with other features (labelled by # in Table 4) are removed from the subset 1 and 2 respectively, resulted in the subset 3 (12 features) and subset 4 (18 features). The detailed description of the subsets 1, 2, 3 and 4 is available in Additional file 1.
The best aveAc of MTLS-SVMs with the subset 4 is 0.792, very close to that of MTLS-SVMs with total features (0.7936). The best aveAc of MTLS-SVMs with the subset 3 is 0.7455, which is slightly poorer than that of MTLS-SVMs with total features (0.7595) (Table 4). Therefore, those 18 features in subset 4 are considered as significant properties related with protein phosphorylation.
In this study, we use a multi-task learning framework to investigate phosphorylation sites across 4 kinase family datasets. In this framework, MT-Feat3 is used to select some common features, which are then validated by MTLS-SVMs classifiers. Selected features are further reduced to 18 features after eliminating features with high correlation coefficients with outer features. These features are considered as important common features for further analysis of possible properties and mechanisms of protein phosphorylation.
Stock AM, Robinson VL, Goudreau PN: Two-component signal transduction. Annu Rev Biochem 2000, 69: 183–215. 10.1146/annurev.biochem.69.1.183
Thomason P, Kay R: Eukaryotic signal transduction via histidine-aspartate phosphorelay. J Cell Sci 2000,113(18):3141–3150.
Wan J, Kang SL, Tang CN, Yan JH, Ren YL, Liu J, Gao XL, Banerjee A, Ellis LBM, Li TB: Meta-prediction of phosphorylation sites with weighted voting and restricted grid search parameter selection. Nucleic acids research 2008,36(4):e22.
Manning G, Whyte DB, Martinez R, Hunter T, Sudarsanam S: The protein kinase complement of the human genome. Science 2002,298(5600):1912–1934. 10.1126/science.1075762
Xue Y, Gao XJ, Cao J, Liu ZX, Jin CJ, Wen LP, Yao XB, Ren JA: A Summary of Computational Resources for Protein Phosphorylation. Curr Protein Pept Sci 2010,11(6):485–496. 10.2174/138920310791824138
Obenauer JC, Cantley LC, Yaffe MB: Scansite 2.0: proteome-wide prediction of cell signaling interactions using short sequence motifs. Nucleic acids research 2003,31(13):3635–3641. 10.1093/nar/gkg584
Kim JH, Lee J, Oh B, Kimm K, Koh IS: Prediction of phosphorylation sites using SVMs. Bioinformatics 2004,20(17):3179–3184. 10.1093/bioinformatics/bth382
Koenig M, Grabe N: Highly specific prediction of phosphorylation sites in proteins. Bioinformatics 2004,20(18):3620–3627. 10.1093/bioinformatics/bth455
Blom N, Sicheritz-Ponten T, Gupta R, Gammeltoft S, Brunak S: Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics 2004,4(6):1633–1649. 10.1002/pmic.200300771
Zhou FF, Xue Y, Chen GL, Yao XB: GPS: a novel group-based phosphorylation predicting and scoring method. Biochemical and Biophysical Research Communications 2004,325(4):1443–1448. 10.1016/j.bbrc.2004.11.001
Huang HD, Lee TY, Tzeng SW, Horng JT: KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites. Nucleic acids research 2005, 33: W226-W229. 10.1093/nar/gki471
Xue Y, Li A, Wang LR, Feng HQ, Yao XB: PPSP: prediction of PK-specific phosphorylation site with Bayesian decision theory. BMC Bioinformatics 2006, 7: 163. 10.1186/1471-2105-7-163
Blom N, Gammeltoft S, Brunak S: Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. Journal of molecular biology 1999,294(5):1351–1362. 10.1006/jmbi.1999.3310
Iakoucheva LM, Radivojac P, Brown CJ, O'Connor TR, Sikes JG, Obradovic Z, Dunker AK: The importance of intrinsic disorder for protein phosphorylation. Nucleic acids research 2004,32(3):1037–1049. 10.1093/nar/gkh253
Caruana R: Multitask Learning: A Knowledge-Based Source of Inductive Bias. Proceedings of the 10th International Conference on Machine Learning 1993, 41–48.
Argyriou A, Evgeniou T, Pontil M: Multi-Task Feature Learning. NIPS 2006.
Argyriou A, Micchelli CA, Pontil M, Ying Y: A Spectral Regularization Framework for Multi-Task Structure Learning. NIPS 2007.
Zhang K, Gray JW, Parvin B: Sparse multitask regression for identifying common mechanism of response to therapeutic targets. Bioinformatics 2010,26(12):i97-i105. 10.1093/bioinformatics/btq181
Xu Q, Pan SJ, Xue HH, Yang Q: Multitask Learning for Protein Subcellular Location Prediction. IEEE/ACM Trans Comput Biol Bioinform 2011, 8: 748–759.
Liu Q, Xu Q, Zheng VW, Xue H, Cao ZW, Yang Q: Multi-task learning for cross-platform siRNA efficacy prediction: an in-silico study. BMC Bioinformatics 2010, 11: 181. 10.1186/1471-2105-11-181
Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M: AAindex: amino acid index database, progress report 2008. Nucleic acids research 2008, 36: D202–205. 10.1093/nar/gkn255
van Gestel T, Suykens JAK, Baesens B, Viaene S, Vanthienen J, Dedene G, de Moor B, Vandewalle J: Benchmarking least squares support vector machine classifiers. Mach Learn 2004,54(1):5–32.
Argyriou A, Evgeniou T, Pontil M: Convex multi-task feature learning. Mach Learn 2008,73(3):243–272. 10.1007/s10994-007-5040-8
Mead A: Review of the Development of Multidimensional-Scaling Methods. Statistician 1992,41(1):27–39. 10.2307/2348634
This article has been published as part of Proteome Science Volume 10 Supplement 1, 2012: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2011: Proteome Science. The full contents of the supplement are available online at http://www.proteomesci.com/supplements/10/S1.
We thank Dr. Andreas Argyriou for his helpful discussion. This work was supported in part by the National Institutes of Health (NIH) Grant P01 AG12993 (PI: E. Michaelis).
The authors declare that they have no competing interests.
SG and SX developed the programs. SG and YF did the dataset construction and calculation, and drafted the manuscript. JF conceived of the project, and participated in its design and coordination and helped to draft the manuscript. All authors read and approved the final manuscript.
About this article
Cite this article
Gao, S., Xu, S., Fang, Y. et al. Using multitask classification methods to investigate the kinase-specific phosphorylation sites. Proteome Sci 10, S7 (2012). https://doi.org/10.1186/1477-5956-10-S1-S7
- Support Vector Machine
- Feature Selection
- Window Size
- Random Forest
- Phosphorylation Site