- Open Access
Decomposing PPI networks for complex discovery
© Liu et al; licensee BioMed Central Ltd. 2011
- Published: 14 October 2011
Protein complexes are important for understanding principles of cellular organization and functions. With the availability of large amounts of high-throughput protein-protein interactions (PPI), many algorithms have been proposed to discover protein complexes from PPI networks. However, existing algorithms generally do not take into consideration the fact that not all the interactions in a PPI network take place at the same time. As a result, predicted complexes often contain many spuriously included proteins, precluding them from matching true complexes.
We propose two methods to tackle this problem: (1) The localization GO term decomposition method: We utilize cellular component Gene Ontology (GO) terms to decompose PPI networks into several smaller networks such that the proteins in each decomposed network are annotated with the same cellular component GO term. (2) The hub removal method: This method is based on the observation that hub proteins are more likely to fuse clusters that correspond to different complexes. To avoid this, we remove hub proteins from PPI networks, and then apply a complex discovery algorithm on the remaining PPI network. The removed hub proteins are added back to the generated clusters afterwards. We tested the two methods on the yeast PPI network downloaded from BioGRID. Our results show that these methods can improve the performance of several complex discovery algorithms significantly. Further improvement in performance is achieved when we apply them in tandem.
The performance of complex discovery algorithms is hindered by the fact that not all the interactions in a PPI network take place at the same time. We tackle this problem by using localization GO terms or hubs to decompose a PPI network before complex discovery, which achieves considerable improvement.
- Gene Ontology
- Decomposition Method
- Maximal Clique
- Reference Complex
- Complex Discovery
High-throughput experimental techniques have produced large amounts of protein interactions, which makes it possible to discover protein complexes from protein-protein interaction (PPI) networks. A PPI network can be modeled as an undirected graph, where vertices represent proteins and edges represent interactions between proteins. Protein complexes are groups of proteins that interact with one another, so they are usually dense subgraphs in PPI networks. Many algorithms have been developed to discover complexes from PPI networks [1–8].
As a model organism, Saccharomyces cerevisiae (baker’s yeast) has been extensively studied, and its PPI network is now relatively complete. However, the performance of existing complex discovery algorithms on the yeast PPI network is not very satisfactory. One reason behind this is that each protein do not necessarily participate in all its known interactions simultaneously. With very few exceptions , existing complex discovery algorithms generally do not take this into consideration. As a result, the clusters generated often contain extra proteins that preclude them from matching true complexes. An ideal solution would be to decompose the PPI network into several smaller networks such that interactions within each smaller network are contextually coherent. In reality, it is very difficult to know which subset of interactions take place together. Here we choose to use cellular component GO terms to decompose PPI networks because a protein complex can be formed only if its proteins are localized within the same compartment of the cell. We use only localization GO terms that are relatively general for decomposition. The existence of hub proteins is another factor that makes it difficult for complex discovery algorithms to decide the boundary of clusters. Hub proteins are proteins that have a lot of neighbors in the PPI network, and these neighbors often belong to multiple complexes . This may fuse clusters that correspond to different complexes. To avoid this, we remove hub proteins from PPI networks prior to clustering. After the clusters are generated from the remaining PPI network, we then add the removed hub proteins back to the clusters.
We tested the above methods on the yeast PPI network downloaded from BioGRID . The results show that these methods can improve the performance of existing complex discovery algorithms significantly. A preliminary version of this paper was presented as a short paper  in BIBM2010. In this version, we have included more experimental results and further discussed why some complexes are so hard to detect. In the rest of the paper, we first describe the two methods for decomposing PPI networks, and then show experiment results.
In this section, we first describe the two methods for decomposing PPI networks for complex discovery, and then briefly introduce the complex discovery algorithms used in our experiments.
The localization GO term decomposition method
A protein complex can only be formed if its proteins are localized within the same compartment of the cell. Hence we use cellular component GO terms to decompose a given PPI network into several smaller PPI networks such that all proteins in each smaller network are annotated with the same localization GO term. We use only localization GO terms that are relatively general for decomposition. There are several reasons for this. First, it is relatively easy to obtain the rough localization of proteins, compared with obtaining the precise and specific localization of proteins. Secondly, very specific GO terms are annotated to very few proteins. Using them to decompose PPI networks produces many small fragments, and lots of information may be lost due to the decomposition. Finally, some very specific cellular component GO terms correspond to complexes, and they are just as hard to decide as complexes.
We use a threshold N GO to select GO terms for decomposition, where N GO should be large. The selected GO terms are annotated to at least N GO proteins, and none of their descendant terms is annotated to at least N GO proteins. If a GO term is selected, then none of its ancestor terms or descendant terms will be selected.
Given a selected GO term, we first remove all the proteins that are not annotated to the term from the given PPI network, and then apply a complex discovery algorithm on the resultant network. This process is repeated for every selected GO term. The final set of clusters is the union of the clusters discovered from every filtered network. Duplicated clusters are removed.
The hub removal method
Hub proteins are those proteins that have many neighbors in the PPI network. We use a threshold N hub to define hub proteins. We call a protein a hub protein if it has at least N hub neighbors. A hub protein often connects proteins that belong to different complexes, which makes it hard to decide the boundary of the complexes and the membership of the hub proteins.
where w(u,v) is the weight of edge (u,v), and it is calculated from the original PPI network using iterative AdjustCD  before removing hubs. If there is no edge between u and v, then w(u, v)=0. A hub protein u is added to a cluster C only if Connectivity(u, C) ≥ hub_add_thres, where hub_add_thres is a number between 0 and 1.
Combining the two methods
Let be the set of clusters generated. Initially is empty.
Remove hub proteins that have at least N hub neighbors from the given PPI network G. Let G′ be the resultant network.
Let g 1,⋯,g m be the localization GO terms that are selected using threshold N GO . For each g i , do the following:
Remove proteins that are not annotated with g i from G′. Let be the resultant network.
Apply a complex discovery algorithm on to find clusters. Let be the set of clusters generated.
Remove duplicated clusters from .
Add hub proteins back to clusters in .
Complex discovery algorithms
We used the following complex discovery algorithms in our study. MCL and RNSC generate a partition of the PPI network, and they do not allow overlap among clusters. The other two algorithms, IPCA and CMC, allow overlap among clusters.
Markov Cluster Algorithm (MCL)  is motivated by a heuristic formulated in terms of stochastic flow. It iteratively enhances the contrast between regions of strong and weak flow in the graph. The process converges towards a partition of the graph, with a set of high-flow regions (the clusters) separated by boundaries with no flow. The performance of MCL is mainly affected by the“-I inflation” option, which controls the granularity of the output clustering.
Restricted Neighborhood Search Clustering (RNSC)  is a cost-based local search algorithm that explores the solution space to minimize a cost function, calculated according to the number of intra-cluster and inter-cluster edges. RNSC searches for a low-cost clustering by first composing an initial random clustering, and then iteratively moving a node from one cluster to another in a randomized fashion to reduce the clustering’s cost. It also makes diversification moves to avoid local minima. RNSC performs several runs, and reports the clustering from the best run. The number of runs is controlled by the “-e” option.
IPCA follows the general approach of cluster expanding based on seeded vertices. It first assigns weights to edges and vertices, and then picks the vertex with the highest weight as the seed of a new cluster. Other vertices are then added to the cluster based on their connectivity. For each of the subsequent cluster, the vertex with the highest weight among those vertices that do not appear in previous clusters is chosen as the seed, and the cluster is expanded using all the vertices except those seed vertices in the previous clusters. Whether a vertex can be added to a cluster is determined by the diameter of the resultant cluster (the “-P” option) and the connectivity between the vertex and the cluster (the “-T” option).
Clustering by Maximal Cliques (CMC)  first generates all the maximal cliques from a given PPI network, and then removes or merges highly overlapping cliques based on their inter-connectivity as follows. Each maximal clique is assigned a score based on their weighted density and size. If the overlap between two maximal cliques exceeds a threshold overlap_thres, then CMC checks whether the inter-connectivity between the two cliques exceeds a threshold merge_thres. If it does, then the two cliques are merged together; otherwise, the clique with lower score is removed.
In this section, we first describe the datasets and the evaluation method used in our experiments, and then study the impact of the two decomposition methods on the performance of the four complex discovery algorithms.
We used the yeast PPI dataset downloaded from BioGRID  (version 3.0.64) in our experiments. We kept only physical interactions that are generated by the following experiment types: Affinity Capture-Luminescence, Affinity Capture-MS, Affinity Capture-RNA, Affinity Capture-Western, Biochemical Activity, Co-crystal Structure, Co-fractionation, Co-localization, Co-purification, Far Western, FRET, PCA, Protein-peptide, Protein-RNA, Reconstituted Complex, Two-hybrid. Self-interactions are removed. The dataset contains 5765 proteins and 52096 binary interactions.
Statistics of reference complexes
Parameter settings of the four complex discovery algorithms
Parameter settings of complex discovery algorithms
-e10 -D50 -d10 -t20 -T3
Results of the GO term decomposition method
Number of GO terms selected under different N G O values
#GO terms selected
Results of the hub removal method
#hub proteins and #PPIs removed under different N hub
#hub proteins removed
We use parameter hub_add_thres to determine when a hub can be added to a cluster. In our experiments, we found that the proper range for hub_add_thres is [0.2, 0.9]. In the rest of the experiments, we set hub_add_thres to 0.3.
It has been proposed that two types of hubs exist: party hubs that interact with all their neighbours simultaneously, and date hubs that interact with different neighbours at different times . We postulate that when N hub ≥ 30, most of the hubs removed correspond to date hubs, as it is physically unlikely for a protein to bind to so many other proteins at the same time due to its limited surface area. However, when removing hubs with fewer neighbours, it might be helpful to identify and remove only date hubs, while ignoring party hubs. To test this hypothesis, we removed only hubs that are part of at least 3, 5, or 7 reference complexes, for N hub =5-9, 10-14, or 15-19. This experiment assumes that we have a classifier which is able to accurately distinguish between date hubs (hubs that belong to many reference complexes) and party hubs (hubs that belong to fewer complexes). However, none of these settings show any significant improvement over not removing these hubs with fewer neighbours, possibly because too few hubs were removed to have a significant impact on performance.
Results of combining the two methods
F1-measure of the four algorithms when match_thres=0.5
In this paper, we proposed two methods to decompose PPI networks for complex discovery. We used four complex discovery algorithms to experimentally study the effectiveness of the two methods. The results show that the two decomposition methods help improve the performance of the four algorithms significantly. The two partitioning clustering algorithms, MCL and RNSC, benefit more from the GO decomposition method, while the two algorithms that allow overlap among clusters, CMC and IPCA, benefit from both.
For the GO term decomposition method, we recommend using localization GO terms that are relative general because their annotations are easier to obtain and they also preserve more information than GO terms that are very specific.
There are two main reasons why some complexes cannot be detected. First, there might be too few interactions existing between proteins in the complex. Secondly, the complex itself might be densely connected, but so is the region surrounding it, which makes it difficult to correctly delineate the boundary around it. Both cases are difficult to handle. We may need to use other information besides PPI data to detect such complexes.
This work was supported in part by a Singapore National Research Foundation grant NRF-G-CRP-2007-04-082(d) (Wong, Liu) and by a National University of Singapore NGS scholarship (Yong).
This article has been published as part of Proteome Science Volume 9 Supplement 1, 2011: Proceedings of the International Workshop on Computational Proteomics. The full contents of the supplement are available online at http://www.proteomesci.com/supplements/9/S1.
- van Dongen S: Graph clustering by flow simulation. PhD thesis, University of Utrecht 2000.Google Scholar
- Przulj N, Wigle D: Functional topology in a network of protein interactions. Bioinformatics 2003,20(3):340–348.View ArticleGoogle Scholar
- Bader G, Hogue C: An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 2003.,4(2):Google Scholar
- Altaf-Ul-Amin M, Shinbo Y, Mihara K, Kurokawa K, Kanaya S: Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC Bioinformatics 2006.,7(207):Google Scholar
- Adamcsek B, Palla G, Farkas I, Derenyi I, Vicsek T: CFinder:locating cliques and overlapping modules in biological networks. Bioinformatics 2006,22(8):1021–1023. 10.1093/bioinformatics/btl039PubMedView ArticleGoogle Scholar
- Chua H, Ning K, Sung W, Leong H, Wong L: Using indirect protein-protein interactions for protein complex predication. Journal of Bioinformatics and Computational Biology 2008,6(3):435–466. 10.1142/S0219720008003497PubMedView ArticleGoogle Scholar
- Li M, Chen J, Wang J, Hu B, Chen G: Modifying the DPClus algorithm for identifying protein complexes based on new topological structures. BMC Bioinformatics 2008.,9(398):Google Scholar
- Liu G, Wong L, Chua HN: Complex discovery from weighted PPI networks. Bioinformatics 2009,25(15):1891–1897. 10.1093/bioinformatics/btp311PubMedView ArticleGoogle Scholar
- Habibi M, Eslahchi C, Wong L: Protein Complex Prediction based on k-Connected Subgraphs in Protein Interaction Network. BMC Systems Biology 2010, 4: 129. 10.1186/1752-0509-4-129PubMed CentralPubMedView ArticleGoogle Scholar
- Han JDJ, Bertin N, Hao T, Goldberg DS, Berriz GF, Zhang LV, Dupuy D, Walhout AJM, Cusick ME, Roth FP, Vidal M: Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature 2004, 430: 88–93. 10.1038/nature02555PubMedView ArticleGoogle Scholar
- Stark C, Reguly BJBT, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets. Nucleic Acids Research 2006,34(Database Issue):535–539.View ArticleGoogle Scholar
- Liu G, Yong CH, Chua HN, Wong L: Decomposing PPI networks for complex discovery. BIBM 2010, 280–283.Google Scholar
- King AD, Przulj N, Jurisica I: Protein complex prediction via cost-based clustering. Bioinformatics 2004,20(17):3013–3020. 10.1093/bioinformatics/bth351PubMedView ArticleGoogle Scholar
- Mewes HW, Amid C, Arnold R, Frishman D, Guldener U, Mannhaupt G, Munsterkotter M, Pagel P, Strack N, Stumpflen V, Warfsmann J: MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Research 2004,32(Database issue):41–44.View ArticleGoogle Scholar
- Pu S, Wong J, Turner B, Cho E, Wodak SJ: Up-to-date catalogues of yeast protein complexes. Nucleic Acids Research 2009,37(3):825–831. 10.1093/nar/gkn1005PubMed CentralPubMedView ArticleGoogle Scholar
- Aloy P, Bottcher B, Ceulemans H, Leutwein C, Mellwig C, Fischer S, Gavin AC, Bork P, Superti-Furga G, Serrano L, Russell RB: Structure-Based Assembly of Protein Complexes in Yeast. Science 2004,303(5666):2026–2029. 10.1126/science.1092645PubMedView ArticleGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al.: Gene Ontology: tool for the unification of biology. Nature Genetics 2000, 25: 25–29. 10.1038/75556PubMed CentralPubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.