- Open Access
Protein complex detection with semi-supervised learning in protein interaction networks
© Shi et al; licensee BioMed Central Ltd. 2011
- Published: 14 October 2011
Protein-protein interactions (PPIs) play fundamental roles in nearly all biological processes. The systematic analysis of PPI networks can enable a great understanding of cellular organization, processes and function. In this paper, we investigate the problem of protein complex detection from noisy protein interaction data, i.e., finding the subsets of proteins that are closely coupled via protein interactions. However, protein complexes are likely to overlap and the interaction data are very noisy. It is a great challenge to effectively analyze the massive data for biologically meaningful protein complex detection.
Many people try to solve the problem by using the traditional unsupervised graph clustering methods. Here, we stand from a different point of view, redefining the properties and features for protein complexes and designing a “semi-supervised” method to analyze the problem. In this paper, we utilize the neural network with the “semi-supervised” mechanism to detect the protein complexes. By retraining the neural network model recursively, we could find the optimized parameters for the model, in such a way we can successfully detect the protein complexes. The comparison results show that our algorithm could identify protein complexes that are missed by other methods. We also have shown that our method achieve better precision and recall rates for the identified protein complexes than other existing methods. In addition, the framework we proposed is easy to be extended in the future.
Using a weighted network to represent the protein interaction network is more appropriate than using a traditional unweighted network. In addition, integrating biological features and topological features to represent protein complexes is more meaningful than using dense subgraphs. Last, the “semi-supervised” learning model is a promising model to detect protein complexes with more biological and topological features available.
- Protein Complex
- Protein Interaction Network
- Seed Node
- Protein Interaction Data
- Dense Subgraph
High-throughput assay methodologies, such as microarrays and mass spectrometry, have resulted in the rapid growth of protein data sets, the analysis of which can potentially yield insights into the mechanisms of human diseases and the discovery of new therapeutic interventions . Systematic analysis of the underlying relationships in these protein data sets can potentially provide useful insights into roles of proteins in biological processes .
PPI data sets provide us the good opportunity to systematically analyze the structure of a large living system and also allow us to use it to understand essential principles like essentiality, genetic interactions, functions, functional modules, protein complexes and cellular pathways . Cellular functions and biochemical events are coordinately carried out by groups of proteins interacting with each other in functional modules, and the modular structure of complex networks is critical to functions . Identifying such protein complexes in PPI networks is very important for understanding the structure and function of these fundamental cellular networks. Therefore, developing an effective computational approach to identify those protein complexes should be highly challenging but indispensable.
However, protein complexes are likely to overlap and the interaction data are very noisy. It is a great challenge to effectively analyze the massive data for biologically meaningful protein complex detection. Since most proteins form macromolecular complexes involving two or more proteins to perform biological functions, many people assume protein complexes should be dense subgraphs. Thus some graph clustering based algorithms could be applied to it. Molecular Complex Detection (MCODE)  is the first computational method to detect protein complexes from PPI networks. MCODE first identifies densely connected subgraphs and then uses another post-processing to filter non-dense subgraphs and generate overlapping clusters. Later, Spirin and Mirny  proposed a clique based algorithm, which exhaustively searches all the full cliques as protein complexes in the network. Since using clique is too constrained, they modified it by applying the Super-Paramagnetic Clustering (SPC) and a Monte Carlo (MC) simulation for the same purpose. Instead of adopting the over-constraining full cliques as the basis for protein complexes, Li et al. devised an LCMA algorithm (Local Clique Merge Algorithm) that adopts a local clique merging method as an attempt to address the current incompleteness limitation of protein interaction data. Amin et al. proposed a cluster periphery-tacking algorithm (DPCLus) to detect protein complexes by keeping track of the periphery of a detected cluster. Chua et al.[ 13] proposed an algorithm called PCP (ProteinComplexPrediction) for complex prediction, which utilized the filtered PPI network by FS-weight , clique finding and merging techniques. Ucar et al. developed a refinement method, which uses hub protein duplication strategy to detect dense subgraphs in scale-free PPI networks with multi-functional hub proteins assigned to multiple clusters. Adamcsek et al. proposed a CFinder algorithm to find complexes in the PPI networks. CFinder detects k-cliques as modules and then merges modules by calculating their similarities. Mete  extended the density-based clustering method DBSCAN  and used it in the PPI networks. SCAN first forms a cluster by a core node then iteratively merges the neighboring nodes one by one. Finally, the detected clusters are formed to become the predicted protein complexes.
The previous methods are suffering from a serious problem, that is, they all assume protein complexes as dense subgraphs. As Qi et al. pointed out, not all protein complexes are clique-oriented and there are quite a large amount of protein complexes with shapes like star-shape or other forms. In this paper, we will solve the problem from another perspective, redefining the properties and features for protein complexes and using a semi-supervised learning method to build a model to detect those hidden protein complexes in the scale-free PPI networks. First, we choose several biological and topological features to represent the protein complexes. Then, we use the “semi-supervised” mechanism to recursively train the neural network and obtain the optimized parameters for the model. Last, we use the neural network to detect the protein complexes in the protein interaction network.
The paper is organized as follows. First, we identify the difficulties of the problem. Second, we propose some favorable properties for protein complexes. Third, we propose the multi-layer neural network. Fourth, we conduct extensive experiments to verify the effectiveness of the proposed method. Finally, we conclude the paper and propose the future work.
Through extensive observations, we found the following problems are the keys to detect protein complexes in the PPI networks.
Protein interaction data are very noisy. Since a clustering method is based on the protein protein interactions in the graph, more reliable those interactions are, more accurate the clustering result will be. From the previous works , using a weighted and filtered graph instead of traditional unweighted graph to represent a PPI network is proven to be an effective way. Then the problem becomes how to obtain the reliable protein protein interactions in PPI data. Here we are using GO (Gene Ontology) to obtain the similarity between different proteins in the network and build a weighted graph with a setup threshold.
Proteins may participate in multiple protein complexes. Therefore, protein complexes may overlap with each other. These overlaps correspond to proteins’ participation in multiple pathways and the crosstalk between different biological modules. Thus, the traditional paradigm for clustering and putting each protein into one single cluster doesn’t suit our problem well. Instead, we would prefer a method that can detect subgraphs with possible overlaps. Our proposed semi-supervised method overcomes this drawback that many existing graph clustering methods suffered and gives a promising result.
How to represent protein complexes. Most existing clustering methods assume protein complexes as dense subgraphs, which is not always true for the protein complexes in the PPI networks . In addition, all kinds of topologies present in protein complexes, and tremendous variation of the sizes of protein complexes pose a further problem for identifying the specific topologies. Traditional methods were all non-supervised methods which didn’t fully utilize the properties and features of protein complexes. Here we are trying to use both topological properties and biological properties of protein complexes to represent protein complexes and propose a multi-layer neural network based semi-supervised method to detect the hidden protein complexes.
For our experiments, we built our weighted protein interaction networks from DIP data set , which contains 4935 proteins and 14162 interactions. The way to build the weighted network is illustrated in our previous paper . In order to evaluate the predicted complexes, the set of real complexes are selected as the benchmarks. This benchmark set is from MIPS  and we only select those complexes that contain more than two proteins.
where A is the predicted complexes, B is the true protein complexes, V A is the set of proteins in the subgraph A, and V B is the set of proteins in the subgraph B. In this paper, we use an overlapping threshold of 0. 20 to determine a match for all experiments. Predicted protein clusters that match one or more true protein complexes with overlapping scores higher than this threshold are identified as “matched clusters,” and the corresponding true complexes are noted as “matched complexes.”
where f-measure is defined as the harmonic mean of recall and precision. It reflects a combination of precision and recall.
Performance comparison of MCODE (Molecular Complex Detection), NN (Neural Network), SVM and BN (Bayesian Network).
Supervised or Not
In this paper, we analyzed and detected protein complexes in protein-protein interaction networks from a different perspective. Instead of using traditional non-supervised algorithms to find dense subgraphs in the PPI networks, we proposed a semi-supervised prediction model with neural network. Unlike previous methods that relied too much on the density of the subgraph, our algorithm utilizes topological and biological features from known protein complexes. With those characterized features, we could represent protein complexes better than the previous methods. Thus a more accurate prediction model can be built upon them. The comparison results show that our algorithm could identify complexes that are missed by other methods. We also have shown that our method achieves better precision and recall rates for the identified protein complexes. In addition, the framework we proposed is easy to be extended in the future. Since obtaining the features of protein complexes and building the prediction model are independent, we could add more representative features of protein complexes in the future work and adopt other similar prediction models that are similar to neural network. In the next step, we hope to find more representative features to formulate protein complexes either from topological manner or biological manner. Also, with more PPI networks of different species becoming available, we could apply the proposed method to the new emerging data sets.
While the existing methods identify protein complexes with strong assumptions about their topology (dense subgraph), our proposed method utilizes multiple features that define protein complexes in protein-protein interaction networks. Instead of only assuming the protein complexes as dense subgraphs, we derive several properties from known protein complexes and use these features to search for the new protein complex. Our algorithm first gains the weights for different features from the limited known protein complexes. Then it will assign a score to any subgraph in the graph. With a setup threshold, we could label some of the subgraphs as complexes. With more complexes, we could train the data again and get more suitable weights for the features, thus better prediction model. Recursively, we will find all protein complexes in the PPI network. Compared with the existing method, our proposed model found more accurate protein complexes in the protein-protein interaction network.
Weighted undirected PPI network
The percentage of function-relevant interactions in three protein interaction data sets
Total number of interactions
Number of functional-relevant interactions
The distribution of features.
number of features
Edge weight statistics
Polarity of amino acids
Graph density: graph density is defined as , where |E| is the number of edges in the graph and |V| is the number of vertices in the graph.
Degree statistics: these features are calculated from the degree of vertexes in the subgraph. A degree is defined as the number of neighbors of a vertex. Mean degree, variance of degrees, median degree and maximum degree are chosen for degree statistics.
Edge weight statistics: we only consider edges with nonzero weights here. Like degree statistics, mean and variance of weights are taken as features.
Topological change : This group of features is gained by measuring the topological changes when different cutoffs of the weights are applied to the graph. Topological changes are measures as T i = (|E i | – |E i +1|)/|E i |, where E i is the number of edges with different cutoffs i.
Clustering coefficient: the clustering coefficient is a measure of degree to which nodes in a graph tend to cluster together. It is defined as , where |e(i, j)| gives the number of triangles that go through node v, whereas d(v)(d(v) – 1)/ 2 is the total number of triangles that could pass through node v.
Topological coefficient : the topological coefficient is a relative measure of the extent to which a protein shares interaction partners with other proteins.
Protein length: the number of amino acids in a protein sequence.
Polarity of amino acids: the longer and more complementary the binding sites, the majority of which would be polar, of the protein, the stronger the proteins would be bound.
A two layers feed-forward neural network based model
The whole updating process terminates when all ∆w ij get so small as to be below some specified threshold.
A semi-supervised learning method for new complexes
Based on the above model, we could use it to evaluate the candidate subgraphs. If the evaluating value exceeds the threshold, the candidate subgraph is predicted to be a complex. So the problem becomes finding subgraphs with high evaluating values in the weighted PPI network. However, as proved in , identifying the set of maximally scoring subgraphs in large graph is NP-hard. Thus, heuristic algorithms are needed here. There are several approaches that have already been used to solve this problem, such as hill climbing, simulated annealing, and tabu-search heuristic .
Weighted DIP PPI network and a training set of protein complexes and randomly generated non-protein complexes.
Learning parameters step
Extract features from the above two groups of complexes.
Use neural network to train the parameters for the prediction model.
Identifying for complexes
Start from the seed nodes, add neighboring proteins of the cluster one by one based on the priority and the impact on the cluster.
Output the complexes when there is no more proteins to satisfy the criterion given above.
Use the newly generated complexes to recursively update the parameters of the model in the second step and find the new complex.
Predict protein complexes.
Our input is the weighted PPI graph and a set of known complexes and non-complexes as training data. The known protein complexes are drawn from MIPS protein complexes and the non-complexes are generated randomly from the DIP protein interaction dataset. First, we use the neural network model to learn model parameters from the training data. Once we get the prediction model, we will start searching for the protein complex. Next, when we have more protein complexes, we recursively train our prediction model and find new protein complexes until there are no more proteins that could be added. The final output complexes are those detected clusters which have a higher evaluation score than the threshold.
This work was partly supported by NSF grand DBI-0234895.
This article has been published as part of Proteome Science Volume 9 Supplement 1, 2011: Proceedings of the International Workshop on Computational Proteomics. The full contents of the supplement are available online at http://www.proteomesci.com/supplements/9/S1.
- Hwang W, Cho YR, Zhang A, Ramanathan M: A novel functional module detection algorithm for protein-protein interaction networks. Algorithms for Molecular Biology 2006.,1(24):Google Scholar
- Zhang A: Protein Interaction Networks: Computational Analysis. Cambridge University Press; 2009.View ArticleGoogle Scholar
- Hartwell LH, Hopfield JJ, Leibler S, Murray AW: From molecular to modular cell biology. Nature 1999, 402: c47-c52. 10.1038/35011540PubMedView ArticleGoogle Scholar
- Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl. Acad. Sci. USA 2001,98(8):4569–4574. 10.1073/pnas.061034498PubMed CentralPubMedView ArticleGoogle Scholar
- Aloy P, Russell RB: Interrogating protein interaction networks through structural biology. Proc. Natl. Acad. Sci. USA 2002,99(9):5896–5901. 10.1073/pnas.092147999PubMed CentralPubMedView ArticleGoogle Scholar
- Chen J, Yuan B: Detecting functional modules in the yeast protein-protein interaction network. Bioinformatics 2006,22(18):2283–2290. 10.1093/bioinformatics/btl370PubMedView ArticleGoogle Scholar
- Hwang W, Kim T, Cho YR, Zhang A, Ramanathan M: SIGN: reliable protein interaction identification by integrating the similarity in GO and the similarity in protein interaction networks. Proceedings of 7th IEEE Symposium on Bioinformatics and Bioengineering (BIBE) 2007, 1384–1388.Google Scholar
- Hishigaki H, Nakai K, Ono T, Tanigami A, Takagi T: Assessment of prediction accuracy of protein function from protein-protein interaction data. Yeast 2001, 18: 523–531. 10.1002/yea.706PubMedView ArticleGoogle Scholar
- Bader GD, Hogue CW: An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 2003.,4(2):Google Scholar
- Spirin V, Mirny LA: Protein complexes and functional modules in molecular networks. Proc. Natl. Acad. Sci. USA 2003,100(21):12123–12128. 10.1073/pnas.2032324100PubMed CentralPubMedView ArticleGoogle Scholar
- Li Xiao-Li, et al.: Interaction graph mining for protein complexes using local clique mergeing. Genome Informatics 2005,16(2):260–269.PubMedGoogle Scholar
- Altaf-Ul-Amin M, Shinbo Y, Mihara K, Kurokawa K, Kanaya S: Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC Bioinformatics 2006.,7(207):Google Scholar
- Chua HN, Sung WK, Leong HW, Wong L: Using indirect protein-protein interactions for protein complex prediction. CSB 2007, 97–109.View ArticleGoogle Scholar
- Chua HN, Sung WK, Wong L: Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics 2006,22(13):1623–1630. 10.1093/bioinformatics/btl145PubMedView ArticleGoogle Scholar
- Ucar D, Asur S, Catalyurek UV, Parthasarathy S: Improving Functional Modularity in Protein-Protein Interactions Graphs using Hub-Induced Subgraphs. PKDD 2006, 363: 28–42.Google Scholar
- Adamcsek B, Palla G, Farkas IJ, Derenyi I, Vicsek T: CFinder: locating cliques and overlapping modules in biological networks. Bioinformatics 2006,22(8):1021–1023. 10.1093/bioinformatics/btl039PubMedView ArticleGoogle Scholar
- Mete M, Tang F, Xu X, Yuruk N: A structural approach for finding functional modules from large biological networks. BMC Bioinformatics 2008,9(Suppl 9):SI9.View ArticleGoogle Scholar
- Ester M, Kriegel HP, Sander J, Xu X: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD 1996, 226–231.Google Scholar
- Qi Yanjun, Balem Fernanda, Faloutsos Christos, Klein-Seetharaman Judith, Bar-Joseph Ziv: Protein complex identification by supervised graph local clustering. Bioinformatics 2008,24(13):i250–268. 10.1093/bioinformatics/btn164PubMed CentralPubMedView ArticleGoogle Scholar
- Shi L, Cho YR, Zhang A: ANN Based Protein Function Prediction Using Integrated Protein-Protein Interaction Data. Bioinformatics, Systems Biology and Intelligent Computing, International Joint Conference on 2009, 0: 271–277.Google Scholar
- Cho YR, Hwang W, Zhang A: Modularization of protein interaction networks by incorporating Gene Ontology annotations. Proceedings of IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) 2007, 233–238.Google Scholar
- Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The database of interacting proteins: 2004 update. Nucleic Acid Research 2004, 32: D449-D451. 10.1093/nar/gkh086View ArticleGoogle Scholar
- Pei P, Zhang A: A topological measurement for weighted protein interaction network. Proceedings of 16th IEEE Computational Systems Bioinformatics Conference (CSB) 2005, 268–278.Google Scholar
- Cho YR, Hwang W, Zhang A: Efficient modularization of weighted protein interaction networks using k-hop graph reduction. Proceedings of 6th IEEE Symposium on Bioinformatics and Bioengineering (BIBE) 2006, 289–298.View ArticleGoogle Scholar
- Mewes HW, et al.: MIPS: analysis and annotation of proteins from whole genome in 2005. Nucleic Acid Research 2006, 34: D169-D172. 10.1093/nar/gkj148View ArticleGoogle Scholar
- Joachims T: Learning to classify text using support vector machines. PhD thesis. Cornell University; 2001.Google Scholar
- Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets. Nucleic Acids Research 2006, 34: D535-D539. 10.1093/nar/gkj109PubMed CentralPubMedView ArticleGoogle Scholar
- Chen L, Shi X, Kong X, Zeng Z, Cai YD: Identifying protein complexes using hybrid properties. Journal of Proteome Research 2009,8(11):5212–8. 10.1021/pr900554aPubMedView ArticleGoogle Scholar
- Barabási AL, Oltvai ZN: Network biology: understanding the cell’s functional organization. Nature Reviews: Genetics 2004, 5: 101–113. 10.1038/nrg1272PubMedView ArticleGoogle Scholar
- Han JW, Kamber M: Data Mining concepts and techniques. Morgan Kaufmann; 2006.Google Scholar
- Albert R, Jeong H, Barabási AL: Error and attack tolerance of complex networks. Nature 2000, 406: 378–482. 10.1038/35019019PubMedView ArticleGoogle Scholar
- Virtanen SE: Properties of nonuniform random graph models. Research Report. Helsinki University of Technology, Laboratory for Theoretical Computer Science 2003.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.