Detecting overlapping protein complexes in PPI networks based on robustness
© Wang and Wu et al; licensee BioMed Central Ltd. 2013
Published: 7 November 2013
Recently, large data sets of protein-protein interactions (PPI) which can be modeled as PPI networks are generated through high-throughput methods. And locally dense regions in PPI networks are very likely to be protein complexes. Since protein complexes play a key role in many biological processes, detecting protein complexes in PPI networks is one of important tasks in post-genomic era. However, PPI networks are often incomplete and noisy, which builds barriers to mining protein complexes.
We propose a new and effective algorithm based on robustness to detect overlapping clusters as protein complexes in PPI networks. And in order to improve the accuracy of resulting clusters, our algorithm tries to reduce bad effects brought by noise in PPI networks. And in our algorithm, each new cluster begins from a seed and is expanded through adding qualified nodes from the cluster's neighbourhood nodes. Besides, in our algorithm, a new distance measurement method between a cluster K and a node in the neighbours of K is proposed as well. The performance of our algorithm is evaluated by applying it on two PPI networks which are Gavin network and Database of Interacting Proteins (DIP). The results show that our algorithm is better than Markov clustering algorithm (MCL), Clique Percolation method (CPM) and core-attachment based method (CoAch) in terms of F-measure, co-localization and Gene Ontology (GO) semantic similarity.
Our algorithm detects locally dense regions or clusters as protein complexes. The results show that protein complexes generated by our algorithm have better quality than those generated by some previous classic methods. Therefore, our algorithm is effective and useful.
Nowadays, with the rapid development of advanced proteomics technologies such as mass spectrometry , microarrays  and phage display , large data sets of PPIs are generated. And a common way to analyze these data sets is to portray them as PPI networks. In the post-genomic era, understanding PPI networks is an important task. It is suggested that locally dense regions in PPI networks are likely to be protein complexes. Protein complexes involve genes or proteins which participate in common fundamental biological processes . Since protein complexes play a vital role in the cellular organization and function, detecting protein complexes in PPI networks becomes a burning issue which attracts many researchers.
In order to solve the problem of detecting protein complexes in PPI networks, many researchers have proposed a lot of computational approaches. The method proposed by Spirin and Mirny  in 2003 tried to find all fully connected subgraphs (cliques) as protein complexes by complete enumeration. However, identifying all cliques in PPI networks is an NP-complete problem and clique is a very restrictive condition which leaves out many potential protein complexes. In order to improve this method, CPM  mined k-clique percolation clusters as protein complexes. Although CPM can detect more protein complexes than , it still omits a lot of protein complexes because many protein complexes are not k-clique percolation clusters. And Molecular Complex Detection (MCODE)  proposed by Bader and Hogue was a kind of density-based methods. Based on node weight, MCODE detects locally dense regions as protein complexes. In MCODE algorithm, node weight is assigned to each node in the network according to its local neighbourhood density. In general, MCODE is an effective approach, but it detects only a few protein complexes in PPI networks. Besides, Highly Connected Subgraph method (HCS) , one of graph partition-based methods, used minimum cuts to separate a PPI network into several subgraphs. The resulting subgraphs are regarded as protein complexes as long as they satisfy a specified density threshold. And MCL , another kind of graph partition-based methods, produced highly connected subnetworks as protein complexes through random walks. Though HCS and MCL are fast, they do not support mining overlapping protein complexes. However, in reality, many protein complexes do share proteins. In addition, Restricted Neighborhood Search Clustering Algorithm (RNSC)  which was a cost-based local search algorithm partitioned PPI networks through minimizing a cost function. Since RNSC is a randomized algorithm, the same input data will make RNSC produce different results. And in order to improve the accuracy of clustering results, RNSC uses a filtering step. This step needs information of functional homogeneity, which is a shortcoming of RNSC. And recently, CoAch  has been proposed according to the idea in  that a protein complex generally includes a core and several attachments. It first detects cores from each nodes' neighbourhood graph and then expands each core by adding nodes which connect more than half of nodes in the core.
Kelly et al.  found that many alternative paths often exist between functionally associated proteins since they contribute a lot to the robustness of biological signals' transmission. Therefore, protein complexes should have the characteristic of robustness. As an expansion upon the proceedings version , we propose a novel algorithm to mine overlapping protein complexes on the basis of robustness. In our algorithm, a seed is firstly selected as a new cluster and expands the cluster by adding qualified nodes from the cluster's neighbourhood nodes. Since many existing PPI networks are incomplete and noisy, each edge's weight in resulting clusters should satisfy a specified threshold. The weight of an edge in PPI network is assigned on the basis of the similarity of nodes at the end of the edge. Besides, resulting clusters should have relatively high robustness for trying to find out reliable protein complexes. Moreover, we present a new distance measurement method between a cluster K and a node in the neighbours of K as well. In order to estimate the performance, our algorithm is compared with MCL, CPM and CoAch. The results show our algorithm has an improvement in terms of F-measure, co-localization and GO semantic similarity.
A PPI network can be modeled as an undirected simple graph G=(V, E), in which V represents the set of nodes (proteins) and E represents the set of edges (protein interactions) in the network. Here, self-interactions (loops) and multiple edges between the same pair of nodes are not considered. Before detail description of our algorithm, some terminologies used in the following algorithm section are presented as follows.
where, Γ(u) and Γ(v) are neighbours of u and v respectively. Γ(u) ∩ Γ(v) represents all common neighbours of u and v, and Γ(u) ∪ Γ(v) represents all distinct neighbours of u and v. In our algorithm, edge weight is used to guarantee that in the same cluster every pair of nodes with an edge between them should have relatively high similarity.
The neighbourhood graph  of consists of v, all its neighbours and the edges among them. It is defined as in which and
where, V" is the set of nodes in is the number of nodes in . And deg(u) is the degree of a node u ∈ V" in . In our algorithm, the weight w v of a node v ∈ V is used in the step of seed chosen. If w v is big enough, v has many neighbours in and is a densely connected region.
where, | K | is the number of nodes in K . m is the number of all edges between v and any node in K . deg(K) is the total external degree of K . OUT(K) is the number of edges connecting any node in K with nodes outside K which do not include v and common neighbours of K and v . And OUT(v) is the number of all edges connecting v with nodes which do not include all nodes in K and common neighbours of K and v . Equation (3) defines the function distance of K and v according to both direct connections and common neighbours of K and considers the influence of direct connections between K and v on the function distance. considers the influence of common neighbours of K and v on the function distance. And | K | is an adjustment factor.
Where, n is the number of nodes removed until the remaining graph of K is not connected or contains no nodes. And |K| is the number of all nodes in K before removing operation begins.
Our algorithm includes five main steps which are Input & initialization, Termination, Seed chosen, Cluster formation and Post processing. Each step will be discussed below.
Input & initialization
The input to our algorithm includes an undirected simple PPI network and two parameters which are the value θ of minimum robustness degree allowed for generated clusters, and the value γ of minimum edge weight allowed in clusters.
In the step of initialization, all edges in G whose edge weights are 0 are deleted since they often have low reliability. In addition, set Visited v = false for every node in the PPI network. The Visited v attribute of a node v records whether v has been assigned to any generated cluster or not.
Our algorithm is terminated when following circumstances appear: (1) all nodes in the network are assigned to generated clusters, and (2) the highest node weight in all node weights of nodes which are not assigned to any cluster is 0, which means all nodes which do not belong to any cluster are isolated. Then, generated clusters which have no less than 3 nodes are printed out.
Each cluster starts at a node which is called seed. In the area of detecting protein complexes in PPI networks, seed should have following characteristics. Direct neighbours of seed should be as many as possible and neighbourhood graph of seed should be as densely connected as possible.
In our algorithm, the measurement method of node weight in (2) satisfies requirements of seed selection. Therefore, (2) is used in this step. In G, the node with the highest node weight among all nodes which are not assigned to any generated cluster is selected as seed. And the Visited v attribute of seed is set to true.
A cluster K starts at seed and then grows gradually by adding nodes one by one from neighbours of K which are included in the set N K . At each time, the node with the highest priority in N K is taken into consideration. The priority depends on the functional distance d (v, K) between a node v and a cluster K in Equation (3). The smaller d(v, K) is, the higher priority v has.
Before adding the node with the highest priority to cluster K, two conditions must be checked. First, we make sure that the node's addition in K does not cause the robustness degree of K to fall below θ which is the minimum robustness degree allowed for generated clusters. Second, we guarantee that the node's addition in K does not lead the minimum edge weight in K to fall below γ which is the allowed minimum edge weight in generated clusters. These two conditions guarantee that resulting clusters are densely connected. If any of the above conditions is not satisfied, we delete the node in N K . Then, we continue to check whether the node with the highest priority in the updated N K satisfies the two conditions or not.
Assume that CK is the set of all currently produced clusters and K is a newly detected cluster. Before K is added in CK , neighbour affinities between K and any cluster in CK are calculated and the cluster K M which has the largest neighbour affinity with K in CK is discovered. If the neighbour affinity NA (K, K M ) is greater than or equal to 0.5, K M and K will be merged. Otherwise, K will be added in CK as an independent cluster. This step not only allows generated clusters share common nodes but also prevents generated clusters from having too high overlapping degrees with each other.
In this paper, Gavin  dataset and DIP  which are produced through high-throughput technology are used to validate our algorithm. In Gavin dataset, there are 1430 proteins and 6532 interactions. And in DIP dataset, there are 4,928 proteins and 17,201 interactions.
In order to evaluate the quality of predicted protein complexes, a benchmark set presented in  is obtained form MIPS , Aloy et al.  and Saccharomyces Genome Database (SGD)  according to GO notations. In this benchmark set, 428 protein complexes are included.
Precision, Recall and F-measure
Assume p is a predicted protein complex and b is a real protein complex in the benchmark. Whether p and b match to each other or not is determined by the neighbour affinity NA( p, b). If NA (p, b). ≥ w, p and b are thought to be matched. Usually, it's a common way to set w to 0.2.
Obviously, the higher co-localization score is, the higher functional similarity between proteins in the same complex is. Therefore, co-localization score is a good way to evaluate the quality of predicted protein complexes.
GO semantic similarity
Comparing GO terms associated with proteins in a protein complex is another indicator of biological relevance. According to GO information, a novel method to measure GO semantic similarity is proposed in . In , for a protein complex, GO semantic similarity score is defined as the average of semantic similarity scores of all protein pairs within the protein complex. And for a set of protein complexes, GO semantic similarity score is defined as the geometric mean of scores of three ontologies which are "cellular component", "biological process" and "molecular function" ontology respectively. These three ontologies scores are obtained from GO semantic similarity scores of all complexes in the set. Therefore, the higher GO semantic similarity score of a set of protein complexes is, the better the quality of the set of predicted protein complexes is.
Complex set comparative evaluation
In this section, the above three categories of evaluation criteria are used to compare the performance of our algorithm with three classic methods which are MCL, CPM and CoAch.
Precision, Recall and F-measure Comparison
The comparison of various algorithms using Gavin dataset.
The comparison of various algorithms using DIP dataset
From Table 1, it is obvious that our algorithm has the highest precision and the third highest recall among the four algorithms. And in Table 1, F-measure of our algorithm is the highest. Similarly, in Table 2, it can be found that our algorithm achieves the highest precision and the third highest recall among the four algorithms. Also, F-measure of our algorithm in Table 2 is the highest.
Recall of our algorithm is not very high because the number of protein complexes predicted by our algorithm is not large. And since MCL and CoAch predict large number of protein complexes, they achieve higher recall than our algorithm. The reason that our algorithm predicts a limited number of protein complexes is that our algorithm tries to use robustness degree and edge weight to find reliable protein complexes in incomplete and noisy PPI networks. Thus, protein complexes predicted by our algorithm have very high precision at the cost of not high recall. Although recall is not high, our algorithm still obtains the highest F-measure on both Gavin and DIP dataset. And as F-measure takes both precision and recall into consideration, it can be found that protein complexes detected by our algorithm have good quality.
The localization dataset published by Huh et al.  is used in the calculation of co-localization score. It is often used in evaluating the quality of protein complexes. And in this dataset, 75% of yeast proteome are classified into 22 distinct sub-cellular locations. In addition, ProCope , a popular tool which offers easy access to different ways to predict and evaluate protein complexes, is used to calculate the co-localization score.
Since proteins within a complex tend to be at the same sub-cellular locations, co-localization score can be used to evaluate the biological relevance of predicted protein complexes. And high co-localization score means high biological relevance of predicted protein complexes. Thus, as our algorithm achieve highest scores on both Gavin and DIP dataset, it is obvious that protein complexes detected by our algorithm have relatively high quality from the biological view.
GO semantic similarity
The similarity of two different GO terms is determined by their most recent common ancestors in the ontology structure. And the similarity of GO terms related to proteins in a protein complex can demonstrate the biological relevance within the complex. In this section, we also use GO semantic similarity score which can show the similarity of proteins within a complex to evaluate the quality of predicted protein complexes. We still make use of ProCope to calculate the GO semantic similarity score.
As high GO semantic score means high biological relevance of predicted complexes, our algorithm which achieves the highest GO semantic scores on both Gavin and DIP dataset can detect protein complexes with relatively good quality.
In this paper, a novel algorithm based on robustness to detect overlapping protein complexes is proposed. First, we present a method to measure the functional distance between a cluster and a node in the neighbours of the cluster. Then, we explain our algorithm in details. Our algorithm is a dense-based approach and tries to find reliable protein complexes in PPI networks. Finally, three categories of evaluation criteria are used to compare the performance of our algorithm with MCL, CPM and CoAch. The results show that our algorithm is better than MCL, CPM and CoAch in terms of F-measure, co-localization and GO semantic similarity, which means our algorithm can discover protein complexes with good quality. Therefore, our algorithm is effective and can be helpful in the future biological study.
This project is supported by the National Natural Science Fund of China (61173061, 71201120), and Specialized Research Fund for the Doctoral Program of Higher Education (20121101110036).
The publication costs for this article were funded by the corresponding author.
This article has been published as part of Proteome Science Volume 11 Supplement 1, 2013: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2012: Proteome Science. The full contents of the supplement are available online at http://www.proteomesci.com/supplements/11/S1.
- Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, Remor M, Hofert C, Schelder M, Brajenovic M, Ruffner H, Merino A, Klein K, Hudak M, Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S, Huhse B, Leutwein C, Heurtier MA, Copley RR, Edelmann A, Querfurth E, Rybin V, Drewes G, Raida M, Bouwmeester T, Bork P, Seraphin B, Kuster B, Neubauer G, Superti-Furga G: Functional organization of yeast proteome by systematic analysis of protein complexes. Nature 2002, 415: 141–147. 10.1038/415141aPubMedView ArticleGoogle Scholar
- Stoll D, Templin MF, Bachmann J, Joos TO: Protein microarrays: applications and future challenges. Curr Opin Drug Discov Devel 2005, 8: 239–252.PubMedGoogle Scholar
- Willats WG: Phage display: practicalities and prospects. Plant Mol Biol 2002, 50: 837–854. 10.1023/A:1021215516430PubMedView ArticleGoogle Scholar
- Hartwell LH, Hopfield JJ, Leibler S, Murray AW: From molecular to modular cell biology. Nature 1999, 402: C47-C52. 10.1038/35011540PubMedView ArticleGoogle Scholar
- Spirin V, Mirny LA: Protein complexes and functional modules in molecular networks. Proc Natl Acad Sci USA 2003, 100: 12123–12128. 10.1073/pnas.2032324100PubMed CentralPubMedView ArticleGoogle Scholar
- Palla G, Derenyi I, Farkas I, Vicsek T: Uncovering the overlapping community structure of complex networks in nature and society. Nature 2005, 435: 814–818. 10.1038/nature03607PubMedView ArticleGoogle Scholar
- Bader GD, Hogue CW: An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 2003, 4: 2. 10.1186/1471-2105-4-2PubMed CentralPubMedView ArticleGoogle Scholar
- Hartuv E, Shamir : A clustering algorithm based on graph connectivity. Information Processing Letters 2000, 76: 175–181. 10.1016/S0020-0190(00)00142-3View ArticleGoogle Scholar
- van Dongen S: Graph Clustering by Flow Simulation. University of Utrecht, Physics; 2000.Google Scholar
- King AD, Przulj N, Jurisica I: Protein complex prediction via cost-based clustering. Bioinformatics 2004, 20: 3013–3020. 10.1093/bioinformatics/bth351PubMedView ArticleGoogle Scholar
- Wu M, Li X, Kwoh C-K, Ng S-K: A Core-Attachment based Method to Detect Protein Complexes in PPI Networks. BMC Bioinformatics 2009, 10: 169. 10.1186/1471-2105-10-169PubMed CentralPubMedView ArticleGoogle Scholar
- Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dümpelfeld B, Edelmann A, Heurtier MA, Hoffman V, Hoefert C, Klein K, Hudak M, Michon AM, Schelder M, Schirle M, Remor M, Rudi T, Hooper S, Bauer A, Bouwmeester T, Casari G, Drewes G, Neubauer G, Rick JM, Kuster B, Bork P, Russell RB, Superti-Furga G: Proteome survey reveals modularity of the yeast cell machinery. Nature 2006, 440: 631–636. 10.1038/nature04532PubMedView ArticleGoogle Scholar
- Kelley R, Ideker T: Systematic interpretation of genetic interactions using protein networks. Nat Biotechnol 2005,23(5):561–566. 10.1038/nbt1096PubMed CentralPubMedView ArticleGoogle Scholar
- Wang SL, Li Y, Tu WC, Wang P: Automatic analysis method of protein expression images based on generalized data field. 2012 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2012), IEEE Computer Society 2012, 1–4.Google Scholar
- Saito R, Suzuki H, Hayashizaki Y: Interaction generality, a measurement to access the reliability of a protein-protein interaction. Nucleic Acids Res 2002,30(5):68. I 63-IView ArticleGoogle Scholar
- Saito R, Suzuki H, Hayashizaki Y: Construction of reliable protein-protein interaction networks with a new interaction generality measure. Bioinformatics 2003,19(6):756–763. 10.1093/bioinformatics/btg070PubMedView ArticleGoogle Scholar
- Chua HN, Sung WK, Wong L: Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics 2006,22(13):1623–1630. 10.1093/bioinformatics/btl145PubMedView ArticleGoogle Scholar
- Brun C, Chevenet F, Martin D, et al.: Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network. Genome Biol 2003,5(1):R6. 10.1186/gb-2003-5-1-r6PubMed CentralPubMedView ArticleGoogle Scholar
- Xenarios I, Rice DW, Salwinski L, Baron MK, Marcotte EM, Eisenberg D: DIP: the database of interacting proteins. Nucleic Acids Research 2000, 28: 289–291. 10.1093/nar/28.1.289PubMed CentralPubMedView ArticleGoogle Scholar
- Friedel CC, Krumsiek J, Zimmer R: Boostrapping the interactome: unsupervised identification of protein complexes in yeast. In 12th Annual International Conference on Research in Computational Molecular Biology (RECOMB), of LNCS Volume 4955. Edited by: Vingron M, Wong L.. Springer; 2008:3–16.Google Scholar
- Mewes HW, Amid C, Arnold R, Frishman D, Guldener U, Mannhaupt G, Munsterkotter M, Pagel P, Strack N, Stumpflen V, Warfsmann J, Ruepp A: MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Research 2004,32(Suppl 1):D41-D44.PubMed CentralPubMedView ArticleGoogle Scholar
- Aloy P, Bottcher B, Ceulemans H, Leutwein C, Mellwig C, Fischer S, Gavin AC, Bork P, Superti-Furga G, Serrano L, RB R: Structure-based assembly of protein complexes in yeast. Science 2004,303(5666):2026–2029. 10.1126/science.1092645PubMedView ArticleGoogle Scholar
- Dwight SS, Harris MA, Dolinski K, Ball CA, Binkley G, Christie KR, Fisk DG, Issel-Tarver L, Schroeder M, Sherlock G, Setheuraman A, Weng S, Botstein D, Cherry JM: Saccharomyces Genome Database provides secondary gene annotation using the Gene Ontology. Nucleic Acids Research 2002, 30: 69–72. 10.1093/nar/30.1.69PubMed CentralPubMedView ArticleGoogle Scholar
- Schlicker A, Domingues F, Rahnenfuhrer J, Lengauer T: A new measure for functional similarity of gene products based on gene ontology. BMC Bioinformatics 2006, 7: 302. 10.1186/1471-2105-7-302PubMed CentralPubMedView ArticleGoogle Scholar
- Huh WK, Falvo JV, Gerke LC, Carroll AS, Howson RW, Weissman JS, O' Shea EK: Global analysis of protein localization in budding yeast. Nature 2003, 425: 686–691. 10.1038/nature02026PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.