 Proceedings
 Open Access
 Published:
Fast subcellular localization by cascaded fusion of signalbased and homologybased methods
Proteome Science volume 9, Article number: S8 (2011)
Abstract
Background
The functions of proteins are closely related to their subcellular locations. In the postgenomics era, the amount of gene and protein data grows exponentially, which necessitates the prediction of subcellular localization by computational means.
Results
This paper proposes mitigating the computation burden of alignmentbased approaches to subcellular localization prediction by a cascaded fusion of cleavage site prediction and profile alignment. Specifically, the informative segments of protein sequences are identified by a cleavage site predictor using the information in their Nterminal shorting signals. Then, the sequences are truncated at the cleavage site positions, and the shortened sequences are passed to PSIBLAST for computing their profiles. Subcellular localization are subsequently predicted by a profiletoprofile alignment supportvectormachine (SVM) classifier. To further reduce the training and recognition time of the classifier, the SVM classifier is replaced by a new kernel method based on the perturbational discriminant analysis (PDA).
Conclusions
Experimental results on a new dataset based on SwissProt Release 57.5 show that the method can make use of the best property of signal and homologybased approaches and can attain an accuracy comparable to that achieved by using fulllength sequences. Analysis of profilealignment score matrices suggest that both profile creation time and profile alignment time can be reduced without significant reduction in subcellular localization accuracy. It was found that PDA enjoys a short training time as compared to the conventional SVM. We advocate that the method will be important for biologists to conduct largescale protein annotation or for bioinformaticians to perform preliminary investigations on new algorithms that involve pairwise alignments.
Background
Motivation of subcellular localization prediction
For a protein to function properly, it must be transported to the correct organelles of a cell and folded into correct 3D structures. Therefore, knowing the subcellular localization of a protein is one step towards understanding its functions. However, the determination of subcellular localization by experimental means is often timeconsuming and laborious. Given the large number of unannotated sequences from genome projects, it is imperative to develop efficient and reliable computation techniques for annotating biological sequences.
In recent years, impressive progress has been made in the computational prediction of subcellular localization. A number of approaches have also been proposed in the literature. These methods can be generally divided into four categories, including predictions based on sorting signals [1–6], global sequence properties [7–10], homology [11–13] and other information in addition to sequences [14, 15]. Methods based on sorting signals are very fast, but they typically suffer from low prediction accuracy. Homologybased methods are more accurate, but they are very slow. Therefore, fast and reliable predictions of subcellular localization still remain a challenge.
Approaches to subcellular localization prediction
Signalbased methods predict the localization via the recognition of Nterminal sorting signals in amino acid sequences. PSORT, proposed by Nakai in 1991 [2], is one of the early predictors that use sorting signals for protein’s subcellular localization. PSORT and its extensions – WoLF PSORT [3, 4] – derive features such as amino acid compositions and the presence of sequence motifs for localization prediction. In the late 90’s, researchers started to investigate the application of neural networks [16] to recognize the sorting signals. In a neural network, patterns are presented to the input layer of artificial neurons, with each neuron implementing a nonlinear function of the weighted sum of the inputs. Because amino acid sequences are of variable length, the input to the neural network is extracted from a short window sliding over the amino acid sequence. TargetP [17, 18] is a wellknown predictor that uses neural networks.
Another type of approaches relies on the fact that proteins of different organelles have different global properties such as aminoacid composition. Based on aminoacid composition and residuepair frequencies, Nakashima and Nishikawa [10] developed a predictor that can discriminate between soluble intracellular and extracellular proteins. Another popular predictor based on amino acid composition is SubLoc [7]. In SubLoc, a query sequence is converted to 20dim aminoacid composition vector for classification by support vector machines (SVMs). Recently, Xu et al. [19] proposed a semisupervised learning technique (a kind of transductive learning) that makes use of unlabelled test data to boost the classification performance of SVMs. One limitation of compositionbased methods is that information about the sequence order is not easy to represent. Some authors proposed using aminoacid pair compositions (dipeptide) [8, 9, 20] and pseudo aminoacid compositions [21] to enrich the representation power of the extracted vectors.
The homologybased methods use the query sequence to search protein databases for homologs [11, 12] and predict the subcellular location of the query sequence as the one to which the homologs belong. This kind of method can achieve very high accuracy when homologs of experimentally verified sequences can be found in the database search [22]. A number of homologybased predictors have been proposed. For example, Proteome Analyst [23] uses the presence or absence of the tokens from certain fields of the homologous sequences in the SwissProt database as a means to compute features for classification. In Kim et al. [24], an unknown protein sequence is aligned with every training sequences (with known subcellular locations) to create a feature vector for classification. Mak et al. [13] proposed a predictor called PairProSVM that uses profile alignment to detect weak similarity between protein sequences. Given a query sequence, a profile is obtained from PSIBLAST search [25]. The profile is then aligned with every training profile to form a score vector for classification by SVMs.
Some predictors not only use amino acid sequences as input but also require extra information such as lexical context in database entries [14] or Gene Ontology entries [15] as input. Although studies have shown that this type of method can outperform sequencebased methods, the performance has only been measured on data sets where all sequences have the required additional information.
Limitations of existing approaches
Among all the methods mentioned above, the signalbased and homologybased methods have attracted a great deal of attention, primarily because of their biological plausibility and robustness in predicting newly discovered sequences. Comparing these two approaches, the signalbased methods seem to be more direct, because they determine the localization from the sequence segments that contain the localization information. However, this type of method is typically limited to the prediction of a few subcellular locations only. For example, the popular TargetP [5, 6] can only detect three localizations: chloroplast, mitochondria, and secretory pathway signal peptide. The homologybased methods, on the other hands, can in theory predict as many localizations as available in the training data. The downside, however, is that the whole sequence is used for the homology search or pairwise alignment, without considering the fact that some segments of the sequence are more important or contain more information than the others. Moreover, the computation requirement will be excessive for long sequences. The problem will become intractable for database annotation where tens of thousands of proteins are involved.
Our proposal for addressing the limitations
Our earlier report [26] has demonstrated that computation time of subcellular localization based on profile alignment SVMs can be substantially reduced by aligning profiles up to the cleavage site positions of signal peptides, mitochondrial targeting peptides, and chloroplast transit peptides. Although 20fold reduction in total computation time (including alignment, training and recognition time) has been achieved, the method fails to reduce the profile creation time, which will become a substantial part of the total computation time when the database becomes large. In this paper, we propose a new approach that can reduce both the profile creation time and profile alignment time. In the new approach, instead of cutting the profiles, we shorten the sequences by cutting them at the cleavage site locations. The shortened sequences are then presented to PSIBLAST to compute the profiles. To further reduce the training and recognition time of the classifier, we propose replacing the SVMs by kernel perturbation discriminants.
Fusion of signal and homologybased methods
Fig. 1 shows the histograms of the length of signal peptides (SP), mitochondrial transit peptides (mTP), and chloroplast transit peptides (cTP). The length is the number of amino acids from the Nterminus up to the cleavage site. It is obvious that the lengths of these peptides are rather short. Given the fact that the majority of proteins in the SwissProt database have about a few hundred amino acids and that some proteins could have length longer than 5,000 amino acids, tremendous computational saving can be achieved by combining the signalbased and homologybased methods described below.
Truncation of profiles/sequences
We have investigated two fusion schemes (see Fig. 2):
I: Truncating Profiles. Given a query sequence, we pass it to PSIBLAST [25] to determine a fulllength profile (PSSM and PSFM [13]). The profile is then truncated at the cleavage site position. The truncated profile is aligned with each of the training profiles to create a vector for classification. Note that the training profiles are also created by the same procedure.
II: Truncating Sequences. Given a query sequence, we truncate it at the cleavage site and pass the truncated sequence to PSIBLAST to determine a shortlength profile. The profile is then aligned with all of the training profiles to create a vector for classification. All training profiles are also created by the same procedure.
Note that as the time taken by PSIBLAST search (profilecreation time) is proportional to the query sequence, Scheme II is expected to provide more computation saving than Scheme I. However, as the sequences are truncated at an early stage, important information may be lost if cleavage site prediction is inaccurate. The “Results and Discussion” Section provides experimental evidences suggesting that Scheme II can provide significant computation saving without suffering from severe information loss.
Cleavage site prediction
This work investigated two cleavage site predictors: conditional random fields (CRFs) [27, 28] and TargetP [5, 6]. CRFs [29] were originally designed for sequence labelling tasks such as PartofSpeech (POS) tagging. Given a sequence of observations, a CRF finds the most likely label for each of the observations. To use CRFs for cleavage site prediction, amino acid sequences are treated as observations and each amino acid in the sequences is labelled as either Signal, Cleavage, or Mature, e.g., SSSSSSCMMMMMM, as illustrated in Fig. 3. The cleavage site is located at the transition between C and M. Amino acids of similar properties can be categorized according to their hydrophobicity and charge/polarity as shown in Table 1. These properties are used because the hregion of signal peptides is rich in hydrophobic residues and the cregion is dominated by small, nonpolar residues [30]. Moreover, as illustrated in Fig. 4, the degree of hydrophobicity is also very different at different positions, making this feature useful for the labelling task.
TargetP is one of the most popular signalbased subcellular localization predictors and cleavage site predictors. Given a query sequence, TargetP can determine its subcellular localization and will also invoke SignalP [31], ChloroP [32], or a program specialized for mTP to determine the cleavage site of the sequence. TargetP requires the Nterminal sequence of a protein as input. During prediction, a sliding window scans over a query sequence; for each segment within the window, a numerically encoded vector is presented to a neural network to compute the segment score. The cleavage site is determined by finding the position at which the score is maximum. The cleavage site prediction accuracy of SignalP on Eukaryotic proteins is around 70% [33] and that of ChloroP on cTP is 60% (±2 residues) [32].
Methods
Data preparation
Protein sequences with experimentally annotated subcellular locations were extracted from the SwissProt Release 57.5 according to the following criteria.

1.
Only the entries of Eukaryotic species, which were annotated with “Eukaryota” in the OC (Organism Classification) fields in SwissProt, were included.

2.
Entries annotated with ambiguous words, such as “probable”, “by similarity” and “potential”, were excluded because of the lack of experimental evidence.

3.
Sequences annotated with “fragment” were excluded.

4.
For signal peptides, mitochondria, and chloroplast, only sequences with experimentally annotated cleavage sites were included.
The extracted sequences were then filtered by BLASTClust [34] so that the resulting sequences have sequence identity less than 25%. Table 2 shows the breakdown of the dataset. A modified version of the Perl scripts provided by [35] was used for creating the dataset.
PDA and SVM for multiclass classification
We used perturbational discriminant analysis (PDA) [36] and support vector machines (SVMs) [37] for classification. The formulation of PDA can be found in the Appendix. During the training phase, N training profiles were obtained by Scheme I or Scheme II. Pairwise profilealignments were then performed to create an N × N symmetric score matrix K, which were then used to train the PDA and SVM classifiers as follows.
Onevsrest PDA and SVM classifier
A Cclass problem can be formulated as C binary classification problems in which each problem is solved by a binary classifier. Given the training sequences of C classes, we trained C PDA score functions:
where x is a query sequence, contains the similarity (via profile alignment) between x and the N training profiles, and a _{ i } and b _{ i } were obtained by Eq. 11 and Eq. 12 in the Appendix.
For the SVM classifier, the score functions in Eq. 1 are replaced by the linear SVM score functions:
where a _{ ij }’s are the Lagrange multipliers of Class i, and y _{ ij } = 1 if x _{ j } belongs to Class i and y _{ ij } = –1 otherwise. Then, given a test sequence x, the class label is given by
Cascaded fusion of PDA and SVM
Instead of using Eqs. 11 and 12, the optimal weights in PDA can also be equivalently expressed in terms of d and η in Eqs. 8 and 9. In a Cclass problem, the ith class will have its corresponding d _{ i } and η _{ i }, where i = 1,…,C. However, because of the dependence in d _{ i }, the rank of matrix [d _{1}, …, d _{ C }] is C – 1. Therefore, there are C – 1 independent sets of PDA parameters:
where 1 is an N dim vector of all 1’s and p is a perturbation parameter. During recognition, an unknown sample x is projected onto a (C – 1)dim PDA space spanned by [a _{1},…,a _{C1}] using
g(x) = Â^{T} k(x) + [b _{1},…, b _{ C– } _{1}]^{T}, g(x) ∊ ℜ^{C} ^{–1} .
Then, g(x) is classified by onevsrest RBFSVMs. In the sequel, we refer to this cascaded fusion as PDAproj+SVM. Fig. 5 exemplifies the capability of PDAproj+SVM using a 2dim multiclass problem.
Performance evaluation
We used 5fold cross validation to evaluate the performance. The overall prediction accuracy, the accuracy for each subcellular location, and the Matthew’s correlation coefficient (MCC) [38] were used to quantify the prediction performance. MCC allows us to overcome the shortcoming of accuracy on unbalanced data [38].
We measured the computation time on a Core 2 Duo 3.16GHz CPU running Matlab and SVMlight. The computation time was divided into profile creation time, alignment time, classifier training time, and classification time.
Results and discussion
Performance of cleavage site prediction
Table 3 shows the cleavage site prediction accuracy of TargetP and CSitePred [28] (a CRFbased predictor). It suggests that CSitePred is better than TargetP(P) in terms of predicting the cleavage sites of signal peptide (SP) but is poorer than TargetP(N). The results also suggest that while CSitePred is slightly inferior to TargetP in predicting the cleavage sites of mitochondria, it is significantly better than TargetP in predicting the cleavage sites of chloroplasts. Note that the overall accuracies depend heavily on the SP class because of the large number of signal peptides in the dataset (see Table 2).
The prediction accuracy of chloroplasts by TargetP shown in Table 3 is significantly lower than that in [32]. There are two reasons for this difference: (1) our dataset has sequence identity lower than that of [32] and (2) we consider predicting precisely the groundtruth sites as correct predictions whereas [32] considers predictions within ±2 positions of the groundtruth sites as correct predictions. In fact, if we relaxed the criterion of correct prediction to ±2 groundtruth positions, the prediction accuracy on chloroplasts achieved by TargetP increases to 47.06%.
Sensitivity analysis
To evaluate the effect of incorrect cleavage site prediction on the accuracy of subcellular localization, sensitivity analysis was performed by truncating SP, mTP, and cTP at the groundtruth cleavage sites and plus/minus several positions of the groundtruths. Specifically, the sequence cutoff positions are 16, 8, and 2 amino acids upstream and 2, 16, 32, and 64 amino acids downstream from the groundtruth cleavage site.
Fig. 6 shows that the overall accuracy of subcellular localization does not rely significantly on the precision of cleavage site prediction as long as the predicted sites are not too far away from the groundtruths.
Apparently, mTP and cTP are more sensitive to the error of cleavage site prediction, which agrees with the fact that the signals of mTP and cTP are weaker. Localization performance of these sequences degrades when the cutoff position drifts away significantly the groundtruth cleavage site. But the overall accuracy can be maintained at above 95% even if the drift is as large as –16 and +64 positions from the groundtruth. Moreover, a forward drift of 64 positions from the ground truth cleavage site leads to a higher overall accuracy when compared to that of a backward drift of 16 positions, which suggests that cutting sequences before their cleavage sites may lose useful information in the signal peptides while including extra (may be irrelevant) information by cutting sequences after their cleavage sites is not detrimental to subcellular location accuracy.
Profilecreation time
Fig. 7 shows the score matrices obtained by the two profile creation schemes illustrated in Fig. 2. The figure shows that the two alignment score matrices exhibit a similar pattern, suggesting that classifiers based on these matrices will produce similar classification accuracy. This argument is confirmed by Table 4, which shows that cutting the sequences at cleavage sites before inputting to PSIBLAST can reduce the profile creation time by 6 times without significant reduction in subcellular localization accuracy.
Profilealignment time
Table 5 shows that the computation time for fulllength profile alignment is striking — nearly thirtyfive seconds per sequence, which suggests that fulllength alignment is computationally prohibitive. Therefore, it is imperative to limit the length of the sequences or profiles before alignment. Table 5 also shows that truncating the sequences at their cleavage site positions leads to nearly a 20 folds reduction in alignment time without suffering from loss in subcellular localization performance. This is because the signal segment can be found in the Nterminus, and removing the amino acids beyond the cleavage site helps the alignment focuses on the relevant features in the profiles and disregard noise.
SVM versus PDA
Table 6 shows that the training time of PDA and PDAproj+SVM are only onefifth of that of SVM. However, the accuracy of PDA and PDAproj+SVM are lower than that of SVM.
Compared with stateoftheart predictors
We compared the accuracy of the proposed fusion of signalbased and homologybased methods with SubLoc [7], TargetP [5] and PairProSVM [13]. Table 7 shows that the overall accuracy of the proposed method (the 5th row) is 5.2% higher than that of TargetP (3rd row) and is significantly better than that of SubLoc (1st row). Our method outperforms TargetP in Ext (SP) and Cyt/Nuc prediction while performing worse than TargetP in predicting Mit and Chl. One limitation of TargetP is that users need to select either “Plant” or “Nonplant”. If the former is selected, the performance of Ext and Cyt/Nuc degrade significantly, leading to a low overall accuracy; if the latter is selected, none of the chloroplast proteins can be correctly predicted. The cascaded fusion of cleavage site prediction and PairProSVM, on the other hand, can classify all four classes with fairly high accuracy, leading to a higher overall accuracy.
The prediction accuracy and MCC of the proposed methods (Rows 4–10 in Table 7) are comparable to PairProSVM (Row 4 in Table 7). The main improvement is on computation time reduction.
Because ChloroP is weak in predicting the cleavage sites of chloroplasts (see Table 3), it is not a good candidate for assisting PairProSVM. This is evident by the low subcellular localization accuracy of chloroplasts in Table 7 when TargetP is used as a cleavage site predictor. However, TargetP is fairly good at predicting the subcellular location of chloroplasts when it is used as a localization predictor.
Among the four classes in Table 7, the subcellular localization accuracies of mitochondria and chloroplasts are generally lower than that of Ext and Cyt/Nuc. The reason may be that these transit peptides are less well characterized and their motifs are less conserved than those of secretary signal peptides [6].
Table 7 also suggests that the TargetP(N) is very effective in assisting PairProSVM, leading to the highest prediction accuracy (92.6%) among all subcellular localization predictors. In particular, except for predicting Chl, TargetP in combination with PairProSVM can surpass the other methods in subcellular localization accuracy and MCC.
Conclusions
This paper has demonstrated that homologybased subcellular localization can be speeded up by reducing the length of the query amino acid sequences. Because shortening an amino acid sequence will inevitably throw away some information in the sequence, it is imperative to determine the best truncation positions. This paper shows that these positions can be determined by cleavage site predictors such as TargetP and CSitePred. The paper also shows that as far as localization accuracy is concerned, it does not matter whether we truncate the sequences or truncate the profiles. However, truncating the sequence has computation advantage because this strategy can save the profile creation time by as much as 6 folds.
Appendix: kernel discriminant analysis
This appendix derives the formulations of kernel discriminant analysis. The key idea lies in the equivalency between the optimal projection vectors in the Hilbert space, spectral space and empirical space.
Input, Hilbert, spectral, and empirical Spaces
Denote the mapping from an input space X into a Hilbert space H as:
In bioinformatics, X is a vectorial space for microarray data and a sequence space for DNA or protein sequences. Given a training dataset {x _{1},…, x _{ N }} in X and a kernel function K(x, y), an object can be represented by a vector of similarity with respect to all of the training objects [39]:
This Ndim space, denoted by K, is called empirical space. The associate kernel matrix is defined as
The construction of the empirical space for vectorial and nonvectorial data are quite different. For the former, the elements of K are a simple function of the corresponding pair of vectors in X. For the latter, the elements in K are similarities between the corresponding pairs of objects.
The kernel matrix K can be factorized with respect to the basis functions in H: K = Φ^{T} Φ, where . Alternatively, it can be factorized via spectral decomposition: where .
Denote the ith row of E as e^{(}^{i}^{)} [e^{(} ^{i} ^{)}(x _{1}),…,e^{(} ^{i} ^{)}(x _{ N })]. Because the rows of E exhibit a vital orthogonality property:
where λ_{ i } is the ith element of the diagonal of Λ.
For any positivedefinite kernel function K(x, y) and training dataset {x _{1},…,x _{ N }} in X, there exists a (nonlinear) mapping from the original input space X to an Ndim spectral space E:
Note that K = E^{T} E, i.e., . Therefore, .
Many kernelbased machine learning problems involve finding optimal projection vectors in H, E, and K, which will be respectively denoted as w, v, and a . It can be shown [36] that the projection vectors are linearly related as follows:
where we have used the relationships w = Φ a and v = Ea .
Orthogonal hyperplane principle (OHP)
Assume that the dimension of H is M and that the training data in H are masscentered. When M >N, all of the N training vectors will fall on an ( M –1)dim data hyperplane. Mathematically, the datahyperplane is represented by its normal vector p such that Φ^{T} p = 1. The optimal decisionhyperplane in H (represented by w) must be orthogonal to the datahyperplane:
w^{T} p = 0 ⇒ α^{T} Φ^{T} p = 0 ⇒ α^{T}1 = 0.
Kernel Fisher discriminant analysis (KFDA)
The objective of KFDA [40] is to determine an optimal discriminant function (linearly) expressed in the Hilbert space H :
where b is a bias to account for the fact that training data may not be masscentered. The discriminant function may be equivalently expressed in the Ndim spectral space E:
The finitedimensional space E facilitates our analysis and design of optimal classifiers. In fact, the optimal projection vector v _{opt} in E can be obtained by applying conventional FDA to the column vectors . To derive the objective function of KFDA, let us define
where and 1 _{+} and 1 _{ â€“ } contain 1’s inentries corresponding to Classes C _{ + } and C _{ â€“ }, respectively, and 0’s otherwise; and N _{ + } and N _{  } are the number of training samples in classes C _{ + } and C _{  }, respectively. It can be shown that the objective function of KFDA is:
where 1 is an N dim vector with all elements equal to 1 and and are betweenclass and withinclass covariance matrices in E space, respectively.
Perturbational discriminant analysis (PDA)
The FDA and KFDA are based on the assumption that the observed data are perfectly measured. It is however crucial to take into account the inevitable perturbation of training data. For the purpose of designing practical classifiers, we can adopt the following perturbational discriminant analysis (PDA).
It is assumed that the observed data is contaminated by additive white noise in the spectral space. Denote the centeradjusted matrix of E as Ē and the uncorrelated noise as N, then the perturbed scattered matrix is
where ρ is a parameter representing the noise level. Its value can sometimes be empirically estimated if the domain knowledge is well established a priori. Under the perturbation analysis, the kernel Fisher score in Eq. 4 is modified to the following perturbed variant:
By taking the derivative of J _{PDA}(V) with respect to V, the optimal solution to Eq. 5 can be obtained as:
and using the ShermanMorrisonWoodbury identity it can be shown that [41]
where η is a scalar whose value can be determined through the optimal solution in K space as follows.
Recall from Eq. 2 that dotproducts in the three spaces are equivalent. Therefore, the discriminant function in K space can be written as:
Given the optimal solution v _{opt} in the E space, the corresponding optimal solution in the K space is^{1}
where we have used K = U^{T} Λ U and . Note that unlike Eq. 6, Eq. 8 does not require spectral decomposition, thus offering a fast closeform solution. Now using the orthogonal hyperplanes principle, we have
Note that unlike Eq. 6, Eq. 8 does not require spectral decomposition, thus offering a fast closeform solution. Also, Eq. 6 suggests that ρ has more regularization effect on the minor components with small eigenvalues than on the major components with large eigenvalues. This serves well the purpose of regularization. Consequently, a PDA classifier will use less proportion of minor (and risky) components and more of major components. Therefore, the parameter p plays two major roles: (1) it can assure the Mercer condition and invertibility of the kernel matrix; and (2) it can suppress the weights assigned to the risker and less resilient components.
The remaining unknown is the bias b. Recall from Eq. 2 that dotproducts in the three spaces are equivalent. Therefore, the discriminant function in K space can be written as:
Putting all training data x _{ i } into Eq. 10, we have
where y _{ i } = 1 when x _{ i } ∊ C _{+} and y _{ i } = –1 when x _{ i } ∊ C _{–}. Since K is invertible, we have a _{opt} = K^{–1}(y–b 1). Eqs. 6 and 8 suggest that perturbation in the spectral space can be represented by shifting the diagonal of K by p. Therefore, taking the perturbation in the spectral space into account, we have
a _{opt} = (K + ρ I)^{–1} (y –b 1). (11)
Note that the solutions given in Eq. 8 and Eq. 11 are equivalent. Now, b can be determined by using the orthogonal hyperplane principle to maximize the interclass separability:
Note that the solutions of a and b in Eqs. 11 and 12 are equivalent to the leastsquares SVM [42], although the way to derive the solutions are different.
References
 1.
von Heijne G: A new method for predicting signal sequence cleavage sites. Nucleic Acids Research 1986,14(11):4683–4690. 10.1093/nar/14.11.4683
 2.
Nakai K, Kanehisa M: Expert system for predicting protein localization sites in gramnegative bacteria. Proteins: Structure, Function, and Genetics 1991,11(2):95–110. 10.1002/prot.340110203
 3.
Horton P, Park KJ, Obayashi T, Nakai K: Protein Subcellular Localization Prediction with WoLF PSORT. Proc. 4th Annual Asia Pacific Bioinformatics Conference (APBC06) 2006, 39–48.
 4.
Horton P, Park K, Obayashi T, Fujita N, Harada H, AdamsCollier C, Nakai K: WoLF PSORT: protein localization predictor. Nucleic acids research 2007,35(Web Server issue):585–587.
 5.
Emanuelsson O, Nielsen H, Brunak S, von Heijne G: Predicting subcellular localization of proteins based on their Nterminal amino acid sequence. J. Mol. Biol. 2000,300(4):1005–1016. 10.1006/jmbi.2000.3903
 6.
Emanuelsson O, Brunak S, von Heijne G, Nielsen H: Locating proteins in the cell using TargetP, SignalP, and related tools. Nature Protocols 2007,2(4):953–971. 10.1038/nprot.2007.131
 7.
Hua SJ, Sun ZR: Support vector machine approach for protein subcellular localization prediction. Bioinformatics 2001, 17: 721–728. 10.1093/bioinformatics/17.8.721
 8.
Huang Y, Li YD: Prediction of protein subcellular locations using fuzzy KNN method. Bioinformatics 2004, 20: 21–28. 10.1093/bioinformatics/btg366
 9.
Park KJ, Kanehisa M: Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics 2003,19(13):1656 1663. 10.1093/bioinformatics/btg222
 10.
Nakashima H, Nishikawa K: Discrimination of intracellular and extracellular proteins using amino acid composition and residuepair frequencies. J. Mol. Biol. 1994, 238: 54–61. 10.1006/jmbi.1994.1267
 11.
Mott R, Schultz J, Bork P, Ponting C: Predicting protein cellular localization using a domain projection method. Genome research 2002,12(8):1168–1174. 10.1101/gr.96802
 12.
Scott M, Thomas D, Hallett M: Predicting subcellular localization via protein motif cooccurrence. Genome research 2004,14(10a):1957–1966. 10.1101/gr.2650004
 13.
Mak MW, Guo J, Kung SY: PairProSVM: Protein Subcellular Localization Based on Local Pairwise Profile Alignment and SVM. IEEE/ACM Trans. on Computational Biology and Bioinformatics 2008,5(3):416–422.
 14.
Nair R, Rost B: Inferring subcellular localization through automated lexical analysis. Bioinformatics 2002, 18: S78S76. 10.1093/bioinformatics/18.suppl_1.S78
 15.
Chou K, Shen H: Recent progress in protein subcellular location prediction. Analytical Biochemistry 2007, 370: 1–16. 10.1016/j.ab.2007.07.006
 16.
Baldi P, Brunak S: Bioinformatics : The Machine Learning Approach. 2nd edition. MIT Press; 2001.
 17.
Nielsen H, Engelbrecht J, Brunak S, von Heijne G: A neural network method for identification of prokaryotic and eukaryotic signal perptides and prediction of their cleavage sites. Int. J. Neural Sys. 1997, 8: 581–599. 10.1142/S0129065797000537
 18.
Nielsen H, Engelbrecht J, Brunak S, von Heijne G: Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Engineering 1997, 10: 1–6. 10.1093/protein/10.1.1
 19.
Xu Q, Hu DH, Xue H, Yu W, Yang Q: Semisupervised protein subcellular localization. BMC Bioinformatics 2009., 10:
 20.
Yuan Z: Prediction of protein subcellular locations using Markov chain models. FEBS Letters 1999, 451: 23–26. 10.1016/S00145793(99)005062
 21.
Chou KC: Prediction of protein cellular attributes using pseudo amino acid composition. Proteins: Structure, Function, and Genetics 2001, 43: 246–255. 10.1002/prot.1035
 22.
Nair R, Rost B: Sequence conserved for subcellular localization. Protein Science 2002, 11: 2836–2847.
 23.
Lu Z, Szafron D, Greiner R, Lu P, Wishart DS, Poulin B, Anvik J, Macdonell C, Eisner R: Predicting subcellular localization of proteins using machinelearned classifiers. Bioinformatics 2004,20(4):547–556. 10.1093/bioinformatics/btg447
 24.
Kim JK, Raghava GPS, Bang SY, Choi S: Prediction of subcellular localization of proteins using pairwise sequence alignment and support vector machine. Pattern Recog. Lett. 2006,27(9):996–1001. 10.1016/j.patrec.2005.11.014
 25.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSIBLAST: A new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389
 26.
Wang W, Mak MW, Kung SY: Speeding up Subcellular Localization by Extracting Informative Regions of Protein Sequences for Profile Alignment. In Proc. Computational Intelligence in Bioinformatics and Computational Biology. Montreal; 2010:147–154.
 27.
Mak MW, Kung SY: Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites. In Proc. ICASSP. Taipei; 2009:1605–1608.
 28.
 29.
Lafferty J, McCallum A, Pereira F: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proc. 18th Int. Conf. on Machine Learning 2001.
 30.
von Heijne G: Patterns of amino acids near signalsequence cleavage sites. Eur J Biochem 1983, 133: 17–21. 10.1111/j.14321033.1983.tb07424.x
 31.
Bendtsen JD, Nielsen H, von Heijne G, Brunak S: Improved prediction of signal peptides: SignalP 3.0. J. Mol. Biol. 2004, 340: 783–795. 10.1016/j.jmb.2004.05.028
 32.
Emanuelsson O, Nielsen H, von Heijne G: ChloroP, a neural networkbased method for predicting chloroplast transit peptides and their cleavage sites. Protein Science 1999, 8: 978–984. 10.1110/ps.8.5.978
 33.
Nielsen H, Brunak S, von Heijne G: Machine learning approaches for the prediction of signal peptides and other protein sorting signals. Protein Eng 1999, 12: 3–9. 10.1093/protein/12.1.3
 34.
[http://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html]
 35.
Menne KML, Hermjakob H, Apweiler R: A comparison of signal sequence prediction methods using a test set of signal peptides. Bioinformatics 2000, 16: 741–742. 10.1093/bioinformatics/16.8.741
 36.
Kung SY: Kernel Approaches to Unsupervised and Supervised Machine Learning. In Proc. PCM, LNCS 5879. Edited by: Muneesawang P. SpringerVerlag; 2009:1–32.
 37.
Vapnik VN: Statistical Learning Theory. New York: Wiley; 1998.
 38.
Matthews BW: Comparison of predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta. 1975, 405: 442–451.
 39.
Tsuda K: Support vector classifier with asymmetric kernel functions. In Proc. ESANN. Bruges, Belgium; 1999:183–188.
 40.
Mika S, Ratsch G, Weston J, Scholkopf B, Mullers KR: Fisher discriminant analysis with kernels. In Neural Networks for Signal Processing IX Edited by: Hu YH, Larsen J, Wilson E, Douglas S. 1999, 41–48.
 41.
Kung S, Mak M: PDASVM Hybrid: A Unified Model For KernelBased Supervised Classification. Journal of Signal Processing Systems for Signal, Image, and Video Technology 2011. To appear
 42.
Suykens JAK, Vandewalle J: Least squares support vector machine classifiers. Neural processing letters 1999,9(3):293–300. 10.1023/A:1018628609742
 43.
Wu CH, McLarty JM: Neural Networks and Genome Informatics. Elsevier Science; 2000.
Acknowledgements
This work was in part supported by The Hong Polytechnic University (GU877) and Research Grant Council of the Hong Kong SAR (PolyU 5264/09E). This work is based on our presentation “Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization” in IEEE BIBM’2010, Hong Kong.
This article has been published as part of Proteome Science Volume 9 Supplement 1, 2011: Proceedings of the International Workshop on Computational Proteomics. The full contents of the supplement are available online at http://www.proteomesci.com/supplements/9/S1.
Author information
Additional information
Authors' contributions
M.W. Mak and W. Wang contribute to (1) the idea of cascaded fusion of signalbased and homologybased methods, (2) preparation of data, (3) implementation of CRF cleavage site predictor and SVM/PDA classifiers, and (4) experimental evaluations. S.Y. Kung contributes to (1) the theoretical development and derivation of PDA and (2) the idea of cascaded fusion and sensitivity analysis.
Competing interests
The authors declare that they have no competing interests.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Mak, M., Wang, W. & Kung, S. Fast subcellular localization by cascaded fusion of signalbased and homologybased methods. Proteome Sci 9, S8 (2011) doi:10.1186/147759569S1S8
Published
DOI
Keywords
 Subcellular Localization
 Cleavage Site
 Spectral Space
 Sorting Signal
 Subcellular Localization Prediction