Accuracy improvement in protein complex prediction from protein interaction networks by refining cluster overlaps
© Chiam and Cho; licensee BioMed Central Ltd. 2012
Published: 21 June 2012
Skip to main content
Volume 10 Supplement 1
© Chiam and Cho; licensee BioMed Central Ltd. 2012
Published: 21 June 2012
Recent computational techniques have facilitated analyzing genome-wide protein-protein interaction data for several model organisms. Various graph-clustering algorithms have been applied to protein interaction networks on the genomic scale for predicting the entire set of potential protein complexes. In particular, the density-based clustering algorithms which are able to generate overlapping clusters, i.e. the clusters sharing a set of nodes, are well-suited to protein complex detection because each protein could be a member of multiple complexes. However, their accuracy is still limited because of complex overlap patterns of their output clusters.
We present a systematic approach of refining the overlapping clusters identified from protein interaction networks. We have designed novel metrics to assess cluster overlaps: overlap coverage and overlapping consistency. We then propose an overlap refinement algorithm. It takes as input the clusters produced by existing density-based graph-clustering methods and generates a set of refined clusters by parameterizing the metrics. To evaluate protein complex prediction accuracy, we used the f-measure by comparing each refined cluster to known protein complexes. The experimental results with the yeast protein-protein interaction data sets from BioGRID and DIP demonstrate that accuracy on protein complex prediction has increased significantly after refining cluster overlaps.
The effectiveness of the proposed cluster overlap refinement approach for protein complex detection has been validated in this study. Analyzing overlaps of the clusters from protein interaction networks is a crucial task for understanding of functional roles of proteins and topological characteristics of the functional systems.
Protein-protein interaction data are a crucial resource in understanding the underlying mechanisms of biological processes. In recent years, high-throughput experimental techniques have made remarkable advances in identifying protein-protein interactions on the scale of the entire genome, collectively referred to as the interactome. The rich amount of protein-protein interaction data sets have been integrated and mapped into a protein interaction network [1–3]. This network is represented as an undirected and un-weighted graph where proteins are nodes and interactions are edges.
Over the past few years, systematic analysis of protein interaction networks by theoretical and empirical studies has been in the spotlight in bioinformatics. It has been observed that the genome-scale interaction networks of several model organisms are typically modular . Consequently, a wide range of graph clustering algorithms  have been applied to the interaction networks to predict potential protein complexes, the sets of proteins closely binding each other to perform specific cellular functions.
Previous graph clustering algorithms can be categorized into density-based approaches, hierarchical approaches and partition-based approaches. Density-based approaches detect densely connected subgraphs in protein interaction networks. A typical example in this category is the maximal clique algorithm to detect fully connected subgraphs . Because of the strict constraints of the maximum-size cliques, relatively dense subgraphs are identified by using a density threshold or incorporating the percolation of small-size cliques. Because of computational inefficiency of finding cliques, a number of heuristic seed-growth style algorithms have been presented. They select seeds as initial points and expand them using alternative density functions. Typical examples include MCODE , DPClus , IPCA  and the entropy-based algorithm . The details of these algorithms are discussed in the Method section.
The hierarchical approaches have been frequently applied to genomic or proteomic data because the hierarchical nature of clusters is significant to understand the global structure of functional organizations. Bottom-up hierarchical approaches start with each node as a separate cluster and then iteratively merge the two closest clusters. Top-down hierarchical approaches start with the whole graph as a single cluster and then recursively divide the cluster into smaller clusters. The iterative merging approaches should precisely measure distance or similarity between two clusters by estimating strength of interconnections or statistical significance of common members [11, 12]. For the recursive division, finding exact cutting point for each iteration is a challenging issue. The edge-betweenness method  is an example to detect the hierarchy by identifying a bridge between two potential clusters repeatedly using the betweenness measure. The betweenness of an edge is calculated by the fraction of the shortest paths passing through the edge.
Partition-based approaches explore the best partition of a network, including the periphery. The Restricted Neighborhood Search Clustering (RNSC)  is a cost-based local search algorithm to find an optimal partition. The process begins with a random or user-specified partition. Each vertex on the border of a cluster is then moved to an adjacent cluster in a random manner such that cost is minimized. The cost function captures the ratio of invalid links between clusters to valid links within clusters. Markov Clustering (MCL)  is a fast and scalable partition-based algorithm by flow simulation. This algorithm simulates random walks within a Markov matrix that is mapped to the input graph. It repeatedly alternates between two operators, expansion and inflation, to update the matrix. This process continues until there is no further change in the matrix, terminating with the best partition of the graph.
Although these previous graph clustering algorithms are qualified to detect protein complexes from protein interaction networks, their accuracy is still limited. One of the challenges is overlapping cluster generation. The clustering algorithms should be able to assign each node to multiple clusters because a protein could have different interacting partners at different times and places. However, because the partition-based or hierarchical clustering algorithms always produce disjoint sets, only density-based methods are suitable for detecting overlapping clusters. A previous study  has presented a general model of overlapping sub-network structures. This model was validated by the intra-connection rate of each overlapping cluster.
Examples of overlapping clusters representing the same protein complex
YLL036C YMR213W YJR050W YLR117C YDR416W YGR129W YBR188C YPR101W
YLL036C YDR416W YMR213W YGR129W YLR117C YNR011C YDR364C
YLL036C YJR050W YDR416W YGR129W YLR117C YPL213W YIR009W
YLL036C YDR416W YBR188C YGR129W YLR117C YPR101W
YGL194C YIL112W YDR155C YOL068C YKR029C YBR103W YCR033W
YGL194C YKR029C YCR033W YIL112W
YGL194C YKR029C YBR103W
cAMP-dependent protein kinase
YIL033C YJL164C YPL203W YKL166C
YNL227C YKL166C YPL203W
NuA4 histone acetyltransferase complex
YFL039C YJL081C YPR023C YEL018W YJR082C YNL136W YFL024C YOR244W YGR002C YHR099W YDR359C YNL107W YHR090C
YNL107W YOR244W YFL024C YPR023C
YJR033C YDR202C YDR328C
YDR306C YDR202C YJL204C YGL149W YOR080W YJL149W YMR258C YBR280C YJR033C YML088W YDR131C YLR368W YLR097C YDL132W YLR352W YDR328C YLR224W
YMR054W YJR033C YDR202C YOR270C YBR127C YDL185W YHR060W
In this article, we present a novel systematic approach to refine overlapping clusters and re-generate a new set of clusters from protein interaction networks. The aim of this study is to increase accuracy of protein complex prediction by refining the overlaps. First, we implement five density-based graph-clustering methods to obtain a set of preliminary overlapping clusters. We next introduce a unique strategy to refine the preliminary clusters by applying novel metrics: overlap coverage and overlapping consistency. We propose an overlap refinement algorithm which yields a final set of clusters by parameterizing the metrics. The experimental results with the protein-protein interaction data sets of S. cerevisiae downloaded from BioGRID  and DIP  show that the proposed approach achieves a statistically significant improvement on accuracy of protein complex prediction.
Density-based graph-clustering algorithms search densely connected subgraphs in protein interaction networks. We discuss four commonly-used methods in this category: CFinder, MCODE, DPClus and the entropy-based algorithm.
Palla et al.  introduced a process of k-clique percolation along with the associated definitions of k-clique adjacency and k-clique chain. Two k-cliques are adjacent if they share (k − 1) nodes where k is the number of nodes in each clique. A k-clique chain is the union of a sequence of adjacent k-cliques. A k-clique percolation cluster is then a maximal k-clique chain. CFinder  searches all k-clique percolation clusters in an undirected graph with a parameter k. Larger k values correspond to a higher stringency during the identification of dense subgraphs and provide smaller groups with a higher density of links inside them.
MCODE  is a typical seed-growth style clustering algorithm. It weights each node v by the core-clustering coefficient of v, which is defined as the density of the highest k-core of the directly connected neighbors of v together with v itself. Compared to the general clustering coefficient , the core clustering coefficient amplifies the weights of heavily interconnected regions while deleting many less-connected nodes. The k-core of a graph is a maximal subgraph such that each node in the subgraph has at least k links . The algorithm then seeds a cluster with the highest weighted node and recursively includes a neighboring node if its weight is above a threshold.
DPClus  is also a seed-growth algorithm to find local dense regions based on connectivity. It weights each node by sum of the edge weights to its neighboring nodes, while each edge is weighted by the number of common neighbors between two ending nodes. The node with the highest weight is selected as a seed which becomes a single-node cluster. The cluster grows gradually by adding repeatedly its neighboring nodes if it reaches a density threshold for either the core or the periphery. IPCA  has the same process to DPClus on weighting nodes and selecting a seed. However, on the step of extending the seed cluster, a neighboring node is added if it has a higher ratio of links to the cluster than an interaction probability threshold and if the diameter of the cluster is less than a maximum diameter threshold.
Select a random seed node, and form a seed cluster including the selected seed and its neighbors.
Remove nodes in the cluster iteratively to decrease graph entropy until it is minimal.
Add neighboring nodes of the cluster iteratively to decrease graph entropy until it is minimal.
Output the cluster, and repeat the steps (1), (2) and (3) until no seed candidate remain.
Select a clique of size 3 as an initial cluster.
Add all neighboring nodes of the cluster.
Remove nodes added on the step (2) iteratively to decrease graph entropy until it is minimal.
Repeat the steps (2) and (3) until the step (3) removes all nodes added on the step (2).
Output the cluster, and repeat the steps from (1) to (4) until no seed candidate remain.
This modification allows the clusters to keep growing in the case where the addition of a neighboring node will temporarily increase entropy, but the addition of that node along with certain additional neighboring nodes will ultimately decrease entropy. For example, if there exists a set of densely connected neighboring nodes of a cluster, the original algorithm will only consider each node independently. However, the modified algorithm will consider the set as a whole.
where |o| is the size of the overlap o.
This formula indicates the fraction of the vertices in c i involved in the average overlap. Higher the overlap rate of c i is, more vertices in c i appear in any other clusters on average.
This formula can be used to measure how unique the cluster c i is. Higher overlap coverage of c i indicates that a larger portion of the vertices in ci are also included into other clusters. For instance, if all vertices in c i are shared with other clusters, then c i has the maximum overlap coverage which is 1.
The overlapping consistency ranges between 0 and 1, inclusive, because the values for R overlap (c i ) are upper-bounded by the values of Cov(c i ). For instance, if a vertex in c i also belongs to several different clusters and the other vertices in c i do not belong to any other clusters, then c i has the maximum overlapping consistency because the overlap rate and overlap coverage are the same. If both of the overlapping consistency and the overlap coverage are high, this could indicate the overlapping clusters represent highly related groups.
The cluster overlap refinement algorithm
OverlapOptimization (S, minCov, minCons, minCss)
1 for each g ∈ S
2 if Cov(g) <minCov or Cons(g) <minCons
3 Add g into S'
4 end if
6 Assign all nodes a value of 0
7 Increment the value of each node in g
8 count ← 1
9 Find overlapping clusters with g
10 for each overlapping cluster c
11 g ← g ∪ c
12 Increment the value of each node in c
13 count ← count +1
14 end for
15 Remove from g any node with a value less than (n × minCss)
16 if g is not redundant
17 Add g into S'
18 end if
19 end else
20 end for
21 return S '
The algorithm takes as input a set of preliminary clusters, S. It requires three parameters as thresholds: the minimum overlap coverage minCov, the minimum overlapping consistency minCons, and the minimum consensus constraint minCss. In Line 2 of the algorithm, the minCov and minCons become the minimum boundaries of overlap coverage and consistency for each cluster to be refined. Line 15 enforces the consensus constraint to merge clusters only if they are strongly related. This constraint changes the overlap optimizing process. If this minimum consensus constraint minCss was 100%, then the result would be the intersection of the overlapping clusters. If it was 0%, the result would be the union of them. This constraint can thus be chosen flexibly between the intersection and the union to select only significant vertices from overlapping clusters. The proper selection of the minimum consensus value prevents a set of clusters from being generated by the two extreme cases of the union, which is too generous, and the intersection, which is too strict.
This f-score makes a direct comparison between an output cluster and a gold-standard protein complex without any bias towards the cluster size. For each output cluster, we search for the best match from the list of gold-standard protein complexes in regard to f-scores. The accuracy of clustering algorithms is then measured by the average f-score of the best matches over all output clusters.
We explored the application of our approach to protein-protein interaction data of S. cerevisiae. The genome-wide yeast protein-protein interaction data are publicly available in several open databases such as BioGRID , IntAct , MINT , MIPS , STRING  and DIP . In this experiment, we used two protein-protein interaction data sets. First, we downloaded the core protein-protein interaction data of S. cerevisiae from DIP, which includes 2526 distinct proteins and 5949 interactions between them. The core interactions have been selected from the full data set by curative processes based on protein sequences and RNA expression profiles . We thus expect that most of the interactions in this data set are reliable. However, we have to consider a number of false negatives, i.e. true interactions which do not appear in this data set. Next, we tested with the exceptionally large protein-protein interaction data set of S. cerevisiae from BioGRID, which includes 5590 distinct proteins and 92906 interactions. This data set has been accumulated from high-throughput experimental results published. It is therefore likely to contain a significant number of false positives, i.e. spurious interactions which do not occur in vivo.
To evaluate clustering accuracy of the proposed approach, we acquired the protein complex data recently determined . As gold-standard, we combined both data sets: CYC2008 which has 408 manually curated heteromeric protein complexes derived from small-scale experiments and YHTP2008 which comprises 400 putative complexes collected mostly from high-throughput experimental results.
Clustering results of five density-based approaches and their accuracy on DIP data
number of clusters
average overlap rate
To evaluate accuracy of each method, we measured the average f-score of output clusters comparing to gold-standard protein complexes. As shown in Table 3, the clusters generated by CFinder have the highest average f-score. However, as a drawback, CFinder requires the longest runtime in the large-size complex network among all the methods tested. The clusters generated by the entropy-based method have the lowest average f-score because most of them are extremely small-sized. However, the modification of this method has markedly improved its accuracy by yielding relatively large clusters, and achieved a slightly higher level of accuracy than MCODE and DPClus.
We implemented the cluster overlap refinement approach to assess improvement on protein complex detection. We used as input the set of clusters produced by three clustering algorithms: CFinder, DPClus and the modified entropy-based method. We were not able to test MCODE because the clusters did not have any overlaps. We also dropped testing the original entropy-based method because the average overlapping rate is close to 0. Instead of the entropy-based method, we used the modified entropy-based method for this experiment. The optimal refinement of cluster overlaps was performed by changing the values of three parameters: the minimum overlap coverage threshold (minCov), the minimum overlapping consistency threshold (minCons) and the minimum consensus constraint (minCss). It collected all overlapping clusters which have the overlap coverage and the overlapping consistency greater than their minimum thresholds, and then re-generated a new set of clusters by selecting the optimal value of minCss.
We carried out additional experiments of cluster overlap refinement with the most recent version of the protein-protein interaction data set of S. cerevisiae from BioGRID. This BioGRID interaction network is larger and significantly denser than the DIP network, 2.2 times more distinct proteins and 15 times more edges. Moreover, it has been considered that it includes a large number of false interactions which create extremely complex connectivity. It is thus expected that the accuracy of protein complex detection from BioGRID data is lower than the previous tests with DIP data.
Clustering results of four density-based approaches and their accuracy on BioGRID data
number of clusters
average overlap rate
The generation of the genome-wide protein-protein interactions in model organisms is proceeding rapidly, heightening the demand for advances in the computational techniques to provide systematic mapping and analyze the protein interaction networks. Advanced computational approaches have been applied to uncover functional patterns hidden in the complex systems. In particular, various graph-clustering algorithms have identified potential functional organizations from protein interaction networks.
We have designed a novel approach of analyzing cluster overlaps systematically. Our approach refines the overlapping clusters, generated by any commonly-used density-based clustering techniques, for the purpose of increasing accuracy on protein complex prediction from protein interaction networks. Through a series of newly defined overlap formulas such as overlap coverage and overlapping consistency, the proposed overlap refinement algorithm enhances the quality of the clusters best matching to known protein complexes.
The proposed approach has been tested with two yeast protein-protein interaction data sets: BioGRID which is known as complete interactome and the core set from DIP which is a reliable subset of full data. The preliminary clusters as input have been acquired from several density-based clustering algorithms: CFinder, MCODE, DPClus and the entropy-based method. We discussed the process of finding the best parameter settings for minCov, minCons and minCss in the proposed approach. We finally demonstrated significant improvements on protein complex prediction accuracy after refining preliminary overlapping clusters. These experimental results eventually led to the conclusion that this approach works successfully for any clustering methods and any protein-protein interaction data sets by optimizing the parameter values.
Overlapping is one of the key properties of functional organizations of molecular components. Analyzing the overlaps of clusters from protein interaction networks is a critical task for not only detecting protein complexes but also complete understanding of functional roles of proteins and topological characteristics of the functional systems. This study provides a systematic framework for effective analysis of functional overlap information inherent in biological networks.
This article has been published as part of Proteome Science Volume 10 Supplement 1, 2012: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2011: Proteome Science. The full contents of the supplement are available online at http://www.proteomesci.com/supplements/10/S1.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.