Accuracy improvement in protein complex prediction from protein interaction networks by refining cluster overlaps

Background Recent computational techniques have facilitated analyzing genome-wide protein-protein interaction data for several model organisms. Various graph-clustering algorithms have been applied to protein interaction networks on the genomic scale for predicting the entire set of potential protein complexes. In particular, the density-based clustering algorithms which are able to generate overlapping clusters, i.e. the clusters sharing a set of nodes, are well-suited to protein complex detection because each protein could be a member of multiple complexes. However, their accuracy is still limited because of complex overlap patterns of their output clusters. Results We present a systematic approach of refining the overlapping clusters identified from protein interaction networks. We have designed novel metrics to assess cluster overlaps: overlap coverage and overlapping consistency. We then propose an overlap refinement algorithm. It takes as input the clusters produced by existing density-based graph-clustering methods and generates a set of refined clusters by parameterizing the metrics. To evaluate protein complex prediction accuracy, we used the f-measure by comparing each refined cluster to known protein complexes. The experimental results with the yeast protein-protein interaction data sets from BioGRID and DIP demonstrate that accuracy on protein complex prediction has increased significantly after refining cluster overlaps. Conclusions The effectiveness of the proposed cluster overlap refinement approach for protein complex detection has been validated in this study. Analyzing overlaps of the clusters from protein interaction networks is a crucial task for understanding of functional roles of proteins and topological characteristics of the functional systems.


Background
Protein-protein interaction data are a crucial resource in understanding the underlying mechanisms of biological processes. In recent years, high-throughput experimental techniques have made remarkable advances in identifying protein-protein interactions on the scale of the entire genome, collectively referred to as the interactome. The rich amount of protein-protein interaction data sets have been integrated and mapped into a protein interaction network [1][2][3]. This network is represented as an undirected and un-weighted graph where proteins are nodes and interactions are edges.
Over the past few years, systematic analysis of protein interaction networks by theoretical and empirical studies has been in the spotlight in bioinformatics. It has been observed that the genome-scale interaction networks of several model organisms are typically modular [4]. Consequently, a wide range of graph clustering algorithms [5] have been applied to the interaction networks to predict potential protein complexes, the sets of proteins closely binding each other to perform specific cellular functions.
Previous graph clustering algorithms can be categorized into density-based approaches, hierarchical approaches and partition-based approaches. Densitybased approaches detect densely connected subgraphs in protein interaction networks. A typical example in this category is the maximal clique algorithm to detect fully connected subgraphs [6]. Because of the strict constraints of the maximum-size cliques, relatively dense subgraphs are identified by using a density threshold or incorporating the percolation of small-size cliques. Because of computational inefficiency of finding cliques, a number of heuristic seed-growth style algorithms have been presented. They select seeds as initial points and expand them using alternative density functions. Typical examples include MCODE [7], DPClus [8], IPCA [9] and the entropy-based algorithm [10]. The details of these algorithms are discussed in the Method section.
The hierarchical approaches have been frequently applied to genomic or proteomic data because the hierarchical nature of clusters is significant to understand the global structure of functional organizations. Bottomup hierarchical approaches start with each node as a separate cluster and then iteratively merge the two closest clusters. Top-down hierarchical approaches start with the whole graph as a single cluster and then recursively divide the cluster into smaller clusters. The iterative merging approaches should precisely measure distance or similarity between two clusters by estimating strength of interconnections or statistical significance of common members [11,12]. For the recursive division, finding exact cutting point for each iteration is a challenging issue. The edge-betweenness method [13] is an example to detect the hierarchy by identifying a bridge between two potential clusters repeatedly using the betweenness measure. The betweenness of an edge is calculated by the fraction of the shortest paths passing through the edge.
Partition-based approaches explore the best partition of a network, including the periphery. The Restricted Neighborhood Search Clustering (RNSC) [14] is a costbased local search algorithm to find an optimal partition. The process begins with a random or user-specified partition. Each vertex on the border of a cluster is then moved to an adjacent cluster in a random manner such that cost is minimized. The cost function captures the ratio of invalid links between clusters to valid links within clusters. Markov Clustering (MCL) [15] is a fast and scalable partition-based algorithm by flow simulation. This algorithm simulates random walks within a Markov matrix that is mapped to the input graph. It repeatedly alternates between two operators, expansion and inflation, to update the matrix. This process continues until there is no further change in the matrix, terminating with the best partition of the graph.
Although these previous graph clustering algorithms are qualified to detect protein complexes from protein interaction networks, their accuracy is still limited. One of the challenges is overlapping cluster generation. The clustering algorithms should be able to assign each node to multiple clusters because a protein could have different interacting partners at different times and places. However, because the partition-based or hierarchical clustering algorithms always produce disjoint sets, only density-based methods are suitable for detecting overlapping clusters. A previous study [16] has presented a general model of overlapping sub-network structures. This model was validated by the intra-connection rate of each overlapping cluster.
We also note that the overlapping clusters sharing a set of proteins often represent the same protein complex. The examples in the protein interaction network of S. cerevisiae are shown in Table 1. For each cluster generated by the entropy-based approach, we computed f-measure by comparing to known protein complexes and chose the protein complex with the highest f-score. We observed in this test it occurs frequently that two or more clusters, in particular overlapping clusters, have the highest f-score to the same protein complex. In Table 1, the Prp19-associated complex is composed of eight proteins. The entropy-based method produced three overlapping clusters which have the best matches for the complex in f-measure. Four common proteins (YDR416W, YGR129W, YLL036C, YLR117C) over all three clusters are the members of the complex. The other four proteins in the complex also appear in one of the three clusters. How could we then infer a protein complex with higher accuracy from the overlapping clusters? The next three examples in Table 1 show the case that we can detect a protein complex with higher accuracy when we take the union set of two overlapping clusters. In contrast, the last example in table 1 is the case that the intersection set of two overlapping clusters matches a protein complex more precisely.
In this article, we present a novel systematic approach to refine overlapping clusters and re-generate a new set of clusters from protein interaction networks. The aim of this study is to increase accuracy of protein complex prediction by refining the overlaps. First, we implement five density-based graph-clustering methods to obtain a set of preliminary overlapping clusters. We next introduce a unique strategy to refine the preliminary clusters by applying novel metrics: overlap coverage and overlapping consistency. We propose an overlap refinement algorithm which yields a final set of clusters by parameterizing the metrics. The experimental results with the protein-protein interaction data sets of S. cerevisiae downloaded from BioGRID [17] and DIP [18] show that the proposed approach achieves a statistically significant improvement on accuracy of protein complex prediction.

Previous density-based clustering methods
Density-based graph-clustering algorithms search densely connected subgraphs in protein interaction networks. We discuss four commonly-used methods in this category: CFinder, MCODE, DPClus and the entropybased algorithm.

CFinder
Palla et al. [19] introduced a process of k-clique percolation along with the associated definitions of k-clique adjacency and k-clique chain. Two k-cliques are adjacent if they share (k − 1) nodes where k is the number of nodes in each clique. A k-clique chain is the union of a sequence of adjacent k-cliques. A k-clique percolation cluster is then a maximal k-clique chain. CFinder [20] searches all k-clique percolation clusters in an undirected graph with a parameter k. Larger k values correspond to a higher stringency during the identification of dense subgraphs and provide smaller groups with a higher density of links inside them. [7] is a typical seed-growth style clustering algorithm. It weights each node v by the core-clustering coefficient of v, which is defined as the density of the highest k-core of the directly connected neighbors of v together with v itself. Compared to the general clustering coefficient [21], the core clustering coefficient amplifies the weights of heavily interconnected regions while deleting many less-connected nodes. The k-core of a graph is a maximal subgraph such that each node in the subgraph has at least k links [22]. The algorithm then seeds a cluster with the highest weighted node and recursively includes a neighboring node if its weight is above a threshold.

DPClus
DPClus [8] is also a seed-growth algorithm to find local dense regions based on connectivity. It weights each node by sum of the edge weights to its neighboring nodes, while each edge is weighted by the number of common neighbors between two ending nodes. The node with the highest weight is selected as a seed which becomes a single-node cluster. The cluster grows gradually by adding repeatedly its neighboring nodes if it reaches a density threshold for either the core or the periphery. IPCA [9] has the same process to DPClus on weighting nodes and selecting a seed. However, on the step of extending the seed cluster, a neighboring node is added if it has a higher ratio of links to the cluster than an interaction probability threshold and if the diameter of the cluster is less than a maximum diameter threshold.

Entropy-based algorithm
The entropy-based approach [10] has been recently introduced as a seed-growth algorithm. It repeatedly finds a locally optimal cluster with minimal graph entropy. A high-level description of the algorithm is given below: 1. Select a random seed node, and form a seed cluster including the selected seed and its neighbors. 2. Remove nodes in the cluster iteratively to decrease graph entropy until it is minimal. 3. Add neighboring nodes of the cluster iteratively to decrease graph entropy until it is minimal. 4. Output the cluster, and repeat the steps (1), (2) and (3) until no seed candidate remain.
As a weakness, this algorithm might fall into the local minimum too quickly. To avoid this problem, we We applied the entropy-based approach to the yeast protein interaction network. For each cluster, we computed f-scores by comparing to experimentally determined protein complexes and chose one complex with the highest f-score. It shows five examples that two or more clusters have the highest f-score to the same protein complex.
propose a slight variation of the entropy-based clustering algorithm. (It will be called the modified entropybased method.) A high-level description of the algorithm is given below: 1. Select a clique of size 3 as an initial cluster.
2. Add all neighboring nodes of the cluster.
3. Remove nodes added on the step (2) iteratively to decrease graph entropy until it is minimal. 4. Repeat the steps (2) and (3) until the step (3) removes all nodes added on the step (2). 5. Output the cluster, and repeat the steps from (1) to (4) until no seed candidate remain.
This modification allows the clusters to keep growing in the case where the addition of a neighboring node will temporarily increase entropy, but the addition of that node along with certain additional neighboring nodes will ultimately decrease entropy. For example, if there exists a set of densely connected neighboring nodes of a cluster, the original algorithm will only consider each node independently. However, the modified algorithm will consider the set as a whole.

Cluster overlap analysis
In this section, we introduce novel metrics to define properties of cluster overlaps. Suppose we have a set of n clusters. An overlap is a non-empty intersection of two clusters. Then, the overlaps of a cluster c i can be defined as a non-unique collection of sets of vertices in c i , each of which is an overlap of c i with another cluster. When V (c i ) denotes the set of all vertices in c i , where The cluster c i may have overlaps with several other clusters, and each overlap may have the different number of vertices. The average overlap size of a cluster c i is then formulated as where |o| is the size of the overlap o.

Overlap rate
The overlap rate of a cluster c i is defined as the average overlap size of c i , divided by the total number of vertices in c i .
This formula indicates the fraction of the vertices in c i involved in the average overlap. Higher the overlap rate of c i is, more vertices in c i appear in any other clusters on average.

Overlap coverage
The overlap coverage of a cluster c i represents the ratio of the number of vertices in c i which appear in one or more overlaps of c i .
This formula can be used to measure how unique the cluster c i is. Higher overlap coverage of c i indicates that a larger portion of the vertices in ci are also included into other clusters. For instance, if all vertices in c i are shared with other clusters, then c i has the maximum overlap coverage which is 1.

Overlapping consistency
The overlapping consistency of a cluster c i measures the uniformity of the overlaps of c i . It is calculated as the overlap rate divided by the overlap coverage.
The overlapping consistency ranges between 0 and 1, inclusive, because the values for R overlap (c i ) are upperbounded by the values of Cov(c i ). For instance, if a vertex in c i also belongs to several different clusters and the other vertices in c i do not belong to any other clusters, then c i has the maximum overlapping consistency because the overlap rate and overlap coverage are the same. If both of the overlapping consistency and the overlap coverage are high, this could indicate the overlapping clusters represent highly related groups.

Cluster overlap refinement
We propose an algorithm for refining the preliminary overlapping clusters using the novel metrics defined above. This method creates a new cluster from the preliminary clusters that have significant overlaps by parameterizing the metrics. This cluster overlap refinement algorithm is described in Table 2.
The algorithm takes as input a set of preliminary clusters, S. It requires three parameters as thresholds: the minimum overlap coverage minCov, the minimum overlapping consistency minCons, and the minimum consensus constraint minCss. In Line 2 of the algorithm, the minCov and minCons become the minimum boundaries of overlap coverage and consistency for each cluster to be refined. Line 15 enforces the consensus constraint to merge clusters only if they are strongly related. This constraint changes the overlap optimizing process. If this minimum consensus constraint minCss was 100%, then the result would be the intersection of the overlapping clusters. If it was 0%, the result would be the union of them. This constraint can thus be chosen flexibly between the intersection and the union to select only significant vertices from overlapping clusters. The proper selection of the minimum consensus value prevents a set of clusters from being generated by the two extreme cases of the union, which is too generous, and the intersection, which is too strict.

Clustering accuracy measure
For clustering accuracy evaluation, we compare each cluster to real protein complexes using the f-measure as a combination of precision and recall. Suppose we compare a cluster c with a protein complex p i . Recall, also called a true positive rate or sensitivity, is the ratio of common members of c and p i to the number of proteins in p i .
Precision, also called a positive predictive value, is the ratio of common proteins of c and p i to the number of proteins in c.
The f-score is then the harmonic mean of recall and precision.
This f-score makes a direct comparison between an output cluster and a gold-standard protein complex without any bias towards the cluster size. For each output cluster, we search for the best match from the list of gold-standard protein complexes in regard to f-scores. The accuracy of clustering algorithms is then measured by the average f-score of the best matches over all output clusters.

Data source
We explored the application of our approach to proteinprotein interaction data of S. cerevisiae. The genomewide yeast protein-protein interaction data are publicly available in several open databases such as BioGRID [17], IntAct [23], MINT [24], MIPS [25], STRING [26] and DIP [18]. In this experiment, we used two proteinprotein interaction data sets. First, we downloaded the core protein-protein interaction data of S. cerevisiae from DIP, which includes 2526 distinct proteins and 5949 interactions between them. The core interactions Add g into S' have been selected from the full data set by curative processes based on protein sequences and RNA expression profiles [27]. We thus expect that most of the interactions in this data set are reliable. However, we have to consider a number of false negatives, i.e. true interactions which do not appear in this data set. Next, we tested with the exceptionally large protein-protein interaction data set of S. cerevisiae from BioGRID, which includes 5590 distinct proteins and 92906 interactions. This data set has been accumulated from highthroughput experimental results published. It is therefore likely to contain a significant number of false positives, i.e. spurious interactions which do not occur in vivo.
To evaluate clustering accuracy of the proposed approach, we acquired the protein complex data recently determined [28]. As gold-standard, we combined both data sets: CYC2008 which has 408 manually curated heteromeric protein complexes derived from small-scale experiments and YHTP2008 which comprises 400 putative complexes collected mostly from high-throughput experimental results.

Protein complex detection from DIP data Clustering by existing methods
To predict potential protein complexes from DIP protein-protein interaction data, we tested five densitybased graph clustering approaches: CFinder, MCODE, DPClus, the entropy-based method, and the modified entropy-based method. Their clustering results are shown in Table 3. The entropy-based method produced a large number of small-sized clusters including many singletons, i.e. clusters containing only a single protein, whereas the modified entropy-based method generated a small number of large-sized clusters. The output clusters of CFinder has the highest average overlap rate which is close to 0.2. The overlap rate of a cluster indicates the proportion of overlaps in the cluster on average. The average overlap rate of 0.2 thus implies that 20% of the nodes in each cluster are involved in overlapping on average. Interestingly, MCODE was not able to yield any overlapping clusters although it searches densely connected sub-graphs. For each method, the distribution of occurrences of any protein over all output clusters is plotted in a log scale in Figure 1. We counted how many times each protein occurs in different clusters. As a general pattern of output clusters, these plots describe the exponential decrease of the number of proteins with respect to the number of occurrences. In this experiment, a slightly different trend from the average overlap rate on CFinder was perceptible. CFinder generated less proteins occurring in multiple clusters than DPClus and the modified entropy-based method. This result implies that CFinder produces the clusters with higher overlapping consistency because of a higher overlap rate, but lower overlap coverage because of less proteins involved in overlapping, than the other methods.
To evaluate accuracy of each method, we measured the average f-score of output clusters comparing to gold-standard protein complexes. As shown in Table 3, the clusters generated by CFinder have the highest average f-score. However, as a drawback, CFinder requires the longest runtime in the large-size complex network among all the methods tested. The clusters generated by the entropy-based method have the lowest average fscore because most of them are extremely small-sized. However, the modification of this method has markedly improved its accuracy by yielding relatively large clusters, and achieved a slightly higher level of accuracy than MCODE and DPClus.

Improvement by cluster overlap refinement
We implemented the cluster overlap refinement approach to assess improvement on protein complex detection. We used as input the set of clusters produced by three clustering algorithms: CFinder, DPClus and the modified entropy-based method. We were not able to test MCODE because the clusters did not have any overlaps. We also dropped testing the original entropybased method because the average overlapping rate is close to 0. Instead of the entropy-based method, we used the modified entropy-based method for this experiment. The optimal refinement of cluster overlaps was performed by changing the values of three parameters: the minimum overlap coverage threshold (minCov), the minimum overlapping consistency threshold (minCons) and the minimum consensus constraint (minCss). It collected all overlapping clusters which have the overlap coverage and the overlapping consistency greater than their minimum thresholds, and then re-generated a new set of clusters by selecting the optimal value of minCss. We tested five density-based graph-clustering approaches on the yeast protein-protein interaction data from DIP. Their clustering results and the average overlap rates are shown. The overlap rate of a cluster indicates the proportion of overlaps in the cluster on average. As accuracy of each method on predicting protein complexes, we measured the average f-score of the clusters comparing to protein complexes experimentally determined.
First, we used varying parameter values of minCov and minCons between 0 and 1 to explore the effect of minCss on protein complex detection accuracy. Since the minimum consensus of 0 means taking the union of all overlapping clusters, the output clusters are gradually enlarged as minCss decreases. In contrast, the minimum consensus of 1, as taking the intersection of overlapping clusters, resulted in the lowest average cluster size. Figure 2 shows how accuracy of the clusters refined is affected by different values of minCss. Figure 2(a), (b) and 2(c) are the results when we used the preliminary clusters produced by CFinder, DPClus and the modified entropy-based method, respectively. We varied the parameter values of minCov and minCons from 0.1 to 0.6, but assigned the same value to minCov and minCons for each case. Use of low values of minCov and min-Cons means that the overlap refinement process is applied to the clusters even if they have only a small portion of overlaps. Therefore, when minCov and min-Cons are 0.1 or 0.2, we could observe a pattern such that the average f-score is very low when minCss is lower than 0.3. It is readily understood that naively merging two clusters results in low accuracy. As minCov and minCons increase, we have consistent average fscores regardless of minCss values. If two clusters have a very large overlapping region, then their union set would be similar to their overlap. In the tests of CFinder and DPClus, the average f-scores were not affected by changing minCss when minCov and minCons are 0.6. For the modified entropy-based method, we attained the consistent average f-score when assigning 0.4 to both minCov and minCons. Considering all plots in Figure 2, we chose as the optimal value of minCss 0.7 for CFinder and DPClus and 0.8 for the modified entropy-based method.
Next, we used the selected minCss values to find the optimal combination of minCov and minCons. Figure 3 shows the effect of different parameter settings for min-Cov and minCons on the accuracy of the clusters refined. Figure 3(a), (b) and 3(c) show the results from CFinder, DPClus and the modified entropy-based method, respectively. In the tests of CFinder and the modified entropy-based method, we achieved the best average f-scores of refined clusters when using the lowest values of minCov and minCons. This trend was already observed in the previous experiment for the modified entropy-based method in Figure 2(c). However, for DPClus, the best accuracy was captured in the ranges between 0.2 and 0.5 for minCov and between 0 and 0.2 for minCons, as shown in Figure 3(c).
We analyzed statistically the improvement on protein complex detection by refining cluster overlaps. Figure   Figure 1 Distribution of occurrences of any protein over all clusters. We counted how many times each protein occurs in different clusters. Each plot describes the exponential decrease of the number of proteins with respect to the number of occurrences. All four clustering algorithms have a similar trend of protein occurrences in overlapping clusters. 4 displays in box plots two distributions of f-scores of the clusters before and after overlap refinement for each of the three clustering algorithms. Figure 4(a) obviously demonstrates that the overall accuracy of the clusters produced by CFinder has been improved by refining overlaps because of more than 15% increase of the median point and more than 20% increase of the 3rd quartile (the upper quartile) point in the f-score distribution. Because the clusters which have the overlap coverage and overlapping consistency below the selected thresholds remain intact during the refinement, it is feasible that the 1st quartile (the lower quartile) or the minimum point does not alter in the distribution. As shown in Figure 4(b), the accuracy of the clusters produced by DPClus has been slightly improved by refining overlaps. However, the refinement approach has improved substantially the clusters produced by the modified entropy-based method. Figure 4(c) shows the increments of both the 1st and 3rd quartile points. These results in Figure 4 justify the effectiveness of the proposed overlap refinement approach. They also address that the extent of improvements varies depending on the clustering algorithms and their preliminary clusters.

Protein complex detection from BioGRID data
We carried out additional experiments of cluster overlap refinement with the most recent version of the protein-protein interaction data set of S. cerevisiae from BioGRID. This BioGRID interaction network is larger and significantly denser than the DIP network, 2.2 times more distinct proteins and 15 times more edges. Moreover, it has been considered that it includes a large number of false interactions which create extremely complex connectivity. It is thus expected that the accuracy of protein complex detection from BioGRID data is lower than the previous tests with DIP data.  We tested five density-based graph clustering approaches: CFinder, MCODE, DPClus, the entropybased method, and the modified entropy-based method. However, we were not able to obtain clustering results from CFinder because of its computational inefficiency on the large and dense network. Same to the previous test, MCODE did not generate any overlapping clusters. The entropy-based method produced an extremely large number of singletons and very few clusters containing more than 3 proteins. These clusters are also non-overlapping. Table 4 shows the clustering results and their accuracy. As compared to Table 3, all the methods, except the entropy-based method, generated more clusters with similar overlapping rates. However, as expected, the average f-scores of the clusters decreased remarkably on this complex network.
To select the optimal parameter values of minCov, minCons and minCss for this data set, we applied the same procedure as discussed in the previous section. We dropped testing MCODE and the entropy-based method because of the overlapping rate of 0. We also failed testing the modified entropy-based method because it generated the clusters with lower overlapping consistency, which can be verified by its lower average overlap rate, but extremely higher overlap coverage than the other methods. As a result, the overlap refinement process terminated with merging all output clusters into a single cluster. We thus used the set of clusters produced by DPClus only. First, we changed the values of minCov and minCons to find the best minCss, as shown in Figure 5. The general changing pattern of the average f-score was similar to that in Figure 2(b). As minCov and minCons increase, we have consistent average fscores regardless of minCss values. At low values of minCov and minCons, the average f-score is very low when minCss is lower than 0.1. This plot shows that the optimal minCss value should be selected in the range between 0.4 and 0.5, which is lower than the optimal value chosen in the previous section. We next used the minCss value of 0.4 to find the optimal combination of minCov and minCons. Figure 6 shows the average fscore change by different parameter settings of minCov and minCons. From this result, minCov of 0.2 and min-Cons of 0 should be chosen as the best combination. Figure 7 shows statistical analysis for the improvement on protein complex detection by overlap refinement. The 3rd quartile (the upper quartile) and the maximum points significantly increased after refining overlaps. This result also indicates that the proposed overlap refinement approach works effectively on large-size complex networks.

Conclusion
The generation of the genome-wide protein-protein interactions in model organisms is proceeding rapidly, heightening the demand for advances in the computational techniques to provide systematic mapping and analyze the protein interaction networks. Advanced computational approaches have been applied to uncover functional patterns hidden in the complex systems. In Figure 4 Statistical analysis of accuracy improvement on protein complex detection by refining cluster overlaps from DIP data. For each method, two distributions of f-scores of the clusters before and after refining overlaps are exhibited. The box plots in (a) and (c) show a substantial increase of the 1st quartile, the median and the 3rd quartile points after refining overlaps. The box plots in (b) also show a slight increment of the median and the 3rd quartile points after refining overlaps. These figures justify the effectiveness of the proposed overlap refinement approach. We tested four density-based graph-clustering approaches on the yeast protein-protein interaction data from BioGRID. Their clustering results and the average overlap rates are shown. As accuracy of each method on predicting protein complexes, we measured the average f-score of the clusters comparing to protein complexes experimentally determined. Figure 5 Accuracy change with different parameter settings for minCss on refining cluster overlaps from BioGRID data. This figure shows how the average f-score of the clusters refined is affected by different parameter values of minCss. We used as input the preliminary clusters produced by DPClus. At low values of minCov and minCons, we have a low average f-score when minCss is lower than 0.2. As minCov and minCons increase, we have consistent average f-scores regardless of minCss values. Chiam and Cho Proteome Science 2012, 10(Suppl 1):S3 http://www.proteomesci.com/content/10/S1/S3 particular, various graph-clustering algorithms have identified potential functional organizations from protein interaction networks. We have designed a novel approach of analyzing cluster overlaps systematically. Our approach refines the overlapping clusters, generated by any commonly-used density-based clustering techniques, for the purpose of increasing accuracy on protein complex prediction from protein interaction networks. Through a series of newly defined overlap formulas such as overlap coverage and overlapping consistency, the proposed overlap refinement algorithm enhances the quality of the clusters best matching to known protein complexes.
The proposed approach has been tested with two yeast protein-protein interaction data sets: BioGRID which is known as complete interactome and the core set from DIP which is a reliable subset of full data. The preliminary clusters as input have been acquired from several density-based clustering algorithms: CFinder, MCODE, DPClus and the entropy-based method. We discussed the process of finding the best parameter settings for minCov, minCons and minCss in the proposed approach. We finally demonstrated significant improvements on protein complex prediction accuracy after refining preliminary overlapping clusters. These experimental results eventually led to the conclusion that this approach works successfully for any clustering methods and any protein-protein interaction data sets by optimizing the parameter values.
Overlapping is one of the key properties of functional organizations of molecular components. Analyzing the overlaps of clusters from protein interaction networks is a critical task for not only detecting protein complexes but also complete understanding of functional roles of proteins and topological characteristics of the functional systems. This study provides a systematic framework for effective analysis of functional overlap information inherent in biological networks.