 Research
 Open Access
 Published:
Integrating domain similarity to improve protein complexes identification in TAPMS data
Proteome Science volume 11, Article number: S2 (2013)
Abstract
Background
Detecting protein complexes in proteinprotein interaction (PPI) networks plays an important role in improving our understanding of the dynamic of cellular organisation. However, protein interaction data generated by highthroughput experiments such as yeasttwohybrid (Y2H) and tandem affinitypurification/massspectrometry (TAPMS) are characterised by the presence of a significant number of false positives and false negatives. In recent years there has been a growing trend to incorporate diverse domain knowledge to support largescale analysis of PPI networks.
Methods
This paper presents a new algorithm, by incorporating Gene Ontology (GO) based semantic similarities, to detect protein complexes from PPI networks generated by TAPMS. By taking cocomplex relations in TAPMS data into account, TAPMS PPI networks are modelled as bipartite graph, where bait proteins consist of one set of nodes and prey proteins are on the other. Similarities between pairs of bait proteins are computed by considering both the topological features and GOdriven semantic similarities. Bait proteins are then grouped in to sets of clusters based on their pairwise similarities to produce a set of 'seed' clusters. An expansion process is applied to each 'seed' cluster to recruit prey proteins which are significantly associated with the same set of bait proteins. Thus, completely identified protein complexes are then obtained.
Results
The proposed algorithm has been applied to real TAPMS PPI networks. Fifteen quality measures have been employed to evaluate the quality of generated protein complexes. Experimental results show that the proposed algorithm has greatly improved the accuracy of identifying complexes and outperformed several stateoftheart clustering algorithms. Moreover, by incorporating semantic similarity, the proposed algorithm is more robust to noises in the networks.
Background
Protein complexes, in which multiple proteins physically interact with each other, are essential to organization and functions of cellular machines [1, 2]. As the advance of experimental and computational technologies, an immense amount of proteinprotein interactions (PPIs) have been detected [3–8], which can be represented as in the form of networks. Thus, the accurate identification of protein complexes from such largescale networks of PPIs becomes a challenge.
Yeasttwohybrid (Y2H) and tandem affinitypurification/massspectrometry (TAPMS) are two types of highthroughput experimental techniques which have been widely applied to detect PPIs. Y2H identifies physically pairwise PPIs [3, 4] while TAPMS detects cocomplex relations of complexes by purifying proteins (called prey) that are associated with tagged proteins (called bait) [5, 6, 8].
A network of PPIs is generally represented as an undirected simple graph where proteins correspond to nodes and pairwise interactions correspond to edges. Graphbased clustering algorithms are an effective approach to identify protein complexes. In 2000, Markov Clustering Algorithm (MCL) [9] was proposed for identifying complexes from protein interaction networks by simulating random walks on the graph. During the clustering process, an inflation parameter is applied to enhance the contrast between regions of dense and sparse connections in the graph. The process converges towards a partition of the graph, with a set of subgraphs of high density. In 2003, Bader and Hogue [10] represented PPI networks using their proposed 'Spoke' model and the 'Matrix' model, and applied the Molecular Complex Detection (MCODE) algorithm to detecting protein complexes from the two models. MCODE identifies sets of nodes in which are highly connected, based on the density of neighbours of nodes in the network. In 2006, Brohée and Helden [11] carried out an evaluation on the performance of four clustering algorithms in detecting protein complexes, including MCL and MCODE. Evaluation results showed that comparing to other algorithms, MCL demonstrated its robustness in the context of adding noises to the graph. In 2006, CFinder [12] was proposed to detect overlapping clusters. It explores clusters which are composed of numbers of kcliques where two adjacent kcliques share k1 nodes. Later, a random walk based clustering algorithm, Repeated Random Walks (RRW) [13], was proposed to identify overlapping protein complexes in PPI networks and experimental results demonstrated that RRW obtained clusters with higher precision than MCL [12]. A novel coreattachment based algorithm, COACH, was proposed in 2009 [14]. COACH detects protein complexes with highlydense structure and explores the "coreattachment" organization inside protein complexes. Experimental results [14] showed that COACH achieved better performance than several existing clustering algorithms.
The algorithms introduced above treat PPIs from TAPMS data as binary. In recent years, several researchers take advantage of nonbinary nature of TAPMS data, the cocomplex relations between bait proteins and prey proteins, to identify protein complexes. In 2005, Scholtens et al. [15] modelled TAPMS data as a directed graph where edges link from bait proteins to prey proteins, and then applied Local Modelling algorithm [15] to this directed network to search for dense subnetworks in which all pairs of proteins should be connected. Results showed that predicted complexes from the Local Modelling algorithm mapped well to curated protein complexes. Another example of detecting protein complexes by building a nonbinary model for TAPMS data is a novel algorithm called CODEC [16] proposed in 2011. CODEC constructs a bipartite graph to represent TAPMS data, where one set consisting only of bait proteins while the other set consisting of prey proteins. Edges only link nodes in the two opposite sets. CODEC identifies dense bipartite subgraphs. Experimental results [16] showed the CODEC outperformed other algorithms with higher precision. In 2012, a new bipartite graph based clustering algorithm (BGCA) was developed to identify protein complexes from TAPMS PPI networks [17]. Experimental results demonstrated that, the BGCA algorithm achieved significant improvement in identifying protein complexes from TAPMS data. Greater precision and better accuracy have been achieved and the identified complexes were demonstrated to match well with existing curated protein complexes.
Algorithms introduced above have been developed based on topological features of PPI networks. However, due to experimental limitations, there exist false positives and false negatives in PPIs. Besides physically interacting pairwise relationships between proteins, semantic similarity describes another type of relationship between pairs of proteins by measuring closeness between the two proteins which is based on estimates of ontologybased functional similarity [18, 19]. The Gene Ontology (GO) [20] is the main focus of investigation of semantic similarity in molecular biology [18]. Many measures [19, 21–23] for computing semantic similarities have been proposed by using annotations from the three GO hierarchies [20]  Molecular Function (MF), Biological Process (BP), and Cellular Component (CC). It has been confirmed that GOdriven similarity among genes is a relevant indicator of functional interaction in the investigation of assessment and evaluation of semantic similarity [18]. Results in the study [24] also demonstrated that there is a significant correlation between the semantic similarity of pairwise proteins and their cocomplex membership. It is showed that semantic similarity assists validating the results which are obtained from biomedical studies, such as gene clustering and gene expression data analysis [19]. Therefore, in the paper, it is assumed that incorporating semantic similarity into clustering process can improve the accuracy of identifying protein complexes.
Cai et. al [17] demonstrated that good performance of BGCA in detection of protein complexes in TAPMS PPI network. BGCA identifies protein complexes relying on topological similarity between pairs of bait proteins which is calculated based on the number of commonly shared prey proteins. This paper proposes a new algorithm, which is extended from BGCA, to detect protein complexes from TAPMS data by integrating semantic similarity. Similarity between pairs of bait proteins is obtained by combining topologybased similarity and GOdriven semantic similarity. An agglomerate hierarchical clustering approach is applied to group bait proteins in to clusters which demonstrate greater similarity among proteins in the same cluster than in different cluster. Thus, a set of 'seed' clusters composed of bait proteins is produced. Starting from these 'seed' clusters, a greedily expansion process is developed to recruit prey proteins which are significantly associated with the same set of bait proteins. After expanding from each seed cluster, a final set of protein complexes is outputted. Experimental results demonstrate that, by integrating semantic similarity, not only the accuracy of detection of proteins complexes has been improved, but also the robustness of the algorithm. This paper is an extension from the conference paper [25]. Based on the paper [25], this paper employs more statistical measures to evaluate quality of clustering results of the proposed method. Moreover, the statistical significance of the clustering results of the proposed algorithm is examined by investigating the estimates of random expectation of correct grouping by randomising predicted complexes sets, and the robustness of the proposed algorithm is also investigated.
The organization of the paper is shown as below. We first introduce the methodology of our proposed algorithm followed the presentation and discussion of experimental results. The propose algorithm is applied to two real world TAPMS PPI networks. Several statistical metrics are employed to assess the quality of clustering. Statistical significance of clustering results and the robustness of the proposed algorithm to the false negatives and false positives are also evaluated. Finally, the conclusion and future work is presented.
Methods
Our proposed algorithm is developed from BGCA, which was proposed to detect protein complexes by modelling TAPMS PPI networks as bipartite graph [17]. The algorithm lies on the assumption that, as TAPMS experiment directly detects complex membership by purifying prey proteins which are coassociated with tagged bait proteins [5, 6], a protein complex is institutively composed of a set of bait proteins along with a set of prey proteins that are significantly associated with the same set of bait proteins. Therefore, the core idea in the proposed algorithm is firstly to detect seed clusters composed of bait proteins and then greedily expand from these seed clusters to obtain final clusters. We obtain 'seed' clusters by grouping bait proteins based on their similarities. In this paper, we incorporate GObased semantic similarity with the topologybased similarity. The proposed algorithm has the same process as BGCA [17], the difference lies in the calculation of pairwise similarities of bait proteins, since the proposed algorithm uses the combined similarities to obtain seed clusters.
The pairwise topological similarity among bait proteins is computed based on the number of commonly shared neighbours [17], which is generalized from the notion of Jaccard Similarity Coefficient [26].
a) Semantic similarity
The GO has three ontologies [20], MF, BP and CC, MF refers to information on what a gene product does. BP is related to a biological objective to which a gene product contributes. CC refers to the cellular location of the gene product, including cellular structures and complexes. The reader can refer to [20] for more details. In the paper, we use BP semantic similarity as the first instance.
The basic idea to calculate similarity between gene products is to calculate similarities between all terms that are used to annotate gene products. Let ${b}_{1}$ and ${b}_{2}$ be the two baits, and let $N\left({b}_{1}\right)$ and $N\left({b}_{2}\right)$ denote the set of neighbours of ${b}_{1}$ and ${b}_{2}$, respectively. The semantic similarity, s_sim(b _{ 1 } ,b _{ 2 } ), has two numeric values, that is
Here, simValue falls between [0,1], representing the closeness between pairs of proteins based on information derived from GO BP annotations. The value of 1 indicates that at least one of the two proteins has no annotations found. IEA ("Inferred from electronic annotation") annotations were excluded in the calculation due to their lack of reliability.
b) Combination of two similarities
The topologybased similarity and the semantic similarity are combined together to generate new pairwise similarity measures for bait proteins. A simple way was adopted to combine the two different similarities as first trial by calculating the arithmetic average of topologybased similarities and semantic similarities.
Hereby, a network composed of similarities between pairwise bait proteins could be obtained accordingly.
In the set of clusters obtained from expansion process, there exist overlap clusters. A merging process is applied to obtain the final set of clusters [25]. This paper is an extension from the conference paper [25], and details of BGCA algorithm can be referred to the study in [17].
Results
Preparation of data
In the study, two TAPMS PPI networks are used. One is the dataset published by Gavin et al. [6] with 1993 bait proteins, 2671 prey proteins and 19157 baitprey relationships; the other is the dataset published by Krogan et al. [8], which contains 2233 bait proteins, 5219 prey proteins and 40623 baitprey relationships. There were 94 prey proteins which were suspected as nonspecific contaminants [8], so they were excluded from the raw dataset used in Krogan et al. For convenience, the two datasets are named as Gavin_2006 and Krogan_2006 for short in this paper.
Two goldstandard datasets are employed in our experiments. One is obtained from the Munich database of Interacting Proteins (MIPS) [27], and the other is the set of handcurated complexes derived from the Wodak lab CYC2008 catalogue [28]. The MIPS data file used is dated 18 May 2006 [27]. The MIPS category 550 was removed since it was defined by computerised algorithms only but contains no curated protein complexes [27]. As a result, the goldstandard data of MIPS contains 220 curated complexes. As for CYC2008 catalogue, 408 protein complexes are included.
Evaluation strategy
In order to avoid biases in the evaluation of performance of proposed methods in the paper, the evaluation strategy is carefully designed and applied. The evaluation process in the paper is decided on the following:

1)
A preprocess is applied on the goldstandard data and the set of predicted clusters. The similar preprocess was also adopted in several studies [16, 29].

For benchmark complexes in the goldstandard data, known complexes that contain proteins, all of which are not included in the network, are removed.

For the set of candidate clusters, the clusters which have no overlaps with any benchmark complex are removed.

2)
More than one quality measures are employed: precision/recall/FMeasure [29], sensitivity/Positive Predictive Value (PPV)/geometric accuracy [11], clusterwise homogeneity/complexwise homogeneity/geometric homogeneity [11], BHSensitivity/BHSpecificity/BHFMeasure [10], and Jaccard FMeasure [29]. These quality measures calculate the degree of agreement between generated clusters obtained by clustering algorithms and wellstudied protein complexes in a goldstandard set. The descriptions of these quality measures are provided in the section of quality measures.

3)
Several typical clustering algorithms are employed to be compared with the proposed algorithm in this paper, including MCL [9, 30], MCODE [10], CFinder [12], RRW [13], COACH [14], and CODEC [16]. For each algorithm, the clustering result to be evaluated was obtained by the optimal set of parameters.

4)
The statistical significance of clustering results generated by the proposed algorithm is evaluated by computing quality scores of sets of randomly permutated complexes.

5)
The robustness of the proposed algorithm to false positives and false negatives is evaluated by applying it to randomly altered networks.
Preprocess of goldstandard datasets
The goldstandard datasets adopted in the study are MIPS [27] and CYC2008 [28]. As introduced in evaluation strategy, the goldstandard datasets will be preprocessed before being used in the evaluation. According to different PPI networks, proteins in each goldstandard that are not contained in the corresponding network are removed, and then the singleton complexes are excluded as well. Table 1 presents the statistics of number of proteins, number of complexes and average size of complexes in the original goldstandard datasets as well as in the goldstandard datasets being preprocessed which are used in the experiments.
Selection of parameters
We select the parameters following a trialanderror procedure. Unless indicated otherwise, the results reported in this paper were derived based on the following parameter settings: the hierarchical clustering was implemented with unweighted average linkage and the cutoff values set to 0.3 and 0.25 for Gavin_2006 and Krogan_2006 networks, respectively. The overlapping rate is set to be 0.2.
In experiments, inflation of MCL is set as 3.0 in Gavin_2006 network and 2.0 in Krogan_2006 network respectively since results obtained accordingly are better comparing to other settings of inflation. For MCODE, on Gavin_2006, the depth equal is set to 100, node score percentage as 0.2, Haircut is TURE, Fluff is FALSE and the percentage for complex fluffing as 0.2; while on Krogan_2006, node score percentage is set as 0.1, and other parameters remain the same as those applied in Gavin_2006 network. For CFinder, the results generated from $k=5$ are employed since the results are better compared to other values of k based on quality measures. RRW has three parameters, restart probability, early cutoff and overlapping rate. The value of restart probability, early cutoff and overlapping rate are 0.6, 0.6, 0.2 for Gavin_2006 and 0.5, 0.7 and 0.2 for Krogan_2006, respectively. CODEC has two schemes, which are CODECw0 and CODECw1, and we compare our algorithm to both schemes of CODEC. We only use final predicted clusters from COACH, without considering its predicted core clusters.
Experimental results and discussion
In order to gauge the effect after incorporating the semantic similarity in clustering process, we firstly compare proposed algorithm against the BGCA [17]. Since we use BP semantic similarity as the first instance, therefore, for convenience, the proposed algorithm is referred as BGCA+BP from now on. Then, we evaluate the performance of the proposed algorithm against several existing clustering methods. In the paper [26], it is presented that BGCA+BP performs better than BGCA in terms of six quality scores, such as sensitivity, PPV and geometric accuracy. In this paper, we use 9 more quality scores to further evaluate and compare the performance of BGCA+BP and BGCA.
A. The effect of incorporating semantic similarity on detecting protein complexes
Without incorporating semantic similarity, the similarities computed for pairwise bait proteins are solely based on their locally topological feature, that is, the number of shared neighbours. In Tables 2 and 3, the evaluation results on predicted complexes generated by the BGCA and BGCA+BP from the Gavin_2006 and Krogan_2006 networks are presented.
In [26], BGCA+BP demonstrated higher accuracy and homogeneity value than BGCA, which can also be seen in the Tables 2 and 3. In terms of other 9 quality measures, it can be seen that the BGCA+BP also obtained better values. For example, in Table 2, BGCA+BP achieves 20% increase in BHFMeasure on Gavin_2006 network, while the BHFMeasure value that BGCA+BP obtained is achieves one and a half time as much as that of BGCA. The similar observation can be obtained in Table 3. The fact that BGCA+BP consistently achieves better scores according to the quality measures than BGCA indicates that, combination of topological similarity and semantic similarity can enhance the accuracy of predicting protein complexes.
B. Comparison to other clustering methods
Table 4 presents some statistics, the number and the average size, of clusters generated by all algorithms on the two TAPMS networks. On both networks, the COACH tends to generate clusters of the largest average size, while RRW has clusters of the smallest average size. The CODEC yields the largest number of clusters.

Analysis of experimental results on Gavin_2006 network
Table 5 and Table 6 present quality scores of clustering results generated by different clustering algorithms from Gavin_2006 networks, compared with goldstandards of MIPS and CYC2008.
Based on figures in Table 5, COACH has the highest sensitivity value. The proposed BGCA+BP algorithm achieves the best PPV, as well as best geometric accuracy. The sensitivity value indicates the average fraction of proteins inside a known complex, which is correctly grouped together in the generated clustering result. A large cluster size can artificially increase the sensitivity value, since a large cluster may contain proteins which belong to more than one complex. Small size of cluster may also increase PPV. The high sensitivity value, but low PPV value, of COACH indicates that the high sensitivity value results from large sized clusters generated by COACH. Meanwhile, the high PPV value but the poor sensitivity value of RRW demonstrates that very few benchmark complexes are uncovered in the results generated by RRW.
Apart from COACH and RRW, the scores of sensitivity and PPV obtained by the rest the algorithms are quite balanced. The BGCA and MCL have higher sensitivity than MCODE and CFinder, since more benchmark complexes are uncovered according to the number of matched complexes. The best geometric accuracy suggests that the BGCA+BP can achieve a much better performance as the value of the accuracy reflects the general performance of a clustering algorithm based on the estimation of the overall correspondence between the set of generated clusters and the set of goldstandard complexes. When compared with CYC2008 goldstandard, a similar observation can be obtained, as shown in Table 6.
Homogeneity is the product of the fraction of members in a cluster found in an annotated complex by the fraction of members in the complex found in a cluster. High homogeneity indicates a bidirectional correspondence between a cluster and a complex. The maximal value of homogeneity is 1 when a cluster matches perfectly with a complex, which means that the cluster consists of all its members identified in the complex. As shown in Tables 5, and 6, the BGCA+BP achieves the best performance in terms of the geometric homogeneity value, which reflects the general agreement between identified clusters and benchmark complexes, as well as the quality of a clustering result as a whole.
The precision value of a predicted cluster calculates the absolute fraction of proteins within a cluster which are also found in a benchmark complex. The clusteringwise precision value represents the average precision values over all clusters. RRW has the highest precision score, but again very poor recall value, therefore, the overall PR value for RRW is low, regardless which goldstandard datasets are used. On the other hand, COACH obtains the highest recall value but very low precision. Again, overall, the BGCA+BP achieves the best PR value.
Jaccard index measures the impact of overlapped sections on both predicted clusters and the corresponding benchmark complex, since it considers the proportion of overlap size in the union set of a predicted cluster and a benchmark complex. High Jaccard index suggests that the set of clustering results is very well matched to the set of benchmark complexes. The second best clusterwise Jaccard index, the best complexwise Jaccard index and also the best FMeasure obtained by the proposed method, suggest that the set of clustering results of the proposed method is better matched to the set of benchmark complexes included in all three goldstandards than other algorithms.
The BHSensitivity is used to measure the percentage of benchmark complexes recovered by generated clusters whose overlap score satisfies the given threshold. The BHSpecificity value measures fraction of generated clusters that match benchmark complexes. On Gavin_2006 network, observed from Tables 5 and 6, compared with the two goldstandards separately, BGCA+BP obtains the highest value in BHSpecificity, while CODECw1 has the best BHSensitivity. However, when compared with MIPS goldstandard, BGCA+BP has best value in BHFMeasure; while compared with CYC2008, CODECw1 achieves better BHFMeasure. The reason may be due to the incompleteness of each goldstandard.
By achieving best value in most quality measures, it can be concluded that BGCA+BP outperforms other algorithms on Gavin_2006 network.

Analysis of experimental results on Krogan_2006 network
Tables 7 and 8 present the quality scores for clustering results produced from Krogan_2006 network.
Similar to results of Gavin_2006 network, the BGCA+BP achieves best value in most of overall quality measures, such as geometric accuracy, geometric homogeneity, PR value, and Jaccard FMmeasure. Again, as for BHFMeasure, BGCA+BP has best value when using MIPS as goldstandard, whereas CODECw1 is the best when comparing with CYC2008 goldstandard.
Though the BGCA+BP does not have all the best values, it still achieves most of them, which indicates that BGCA+BP outperforms other clustering algorithms in terms of the overall performance measurement.
C. Statistical significance of clustering results
This section investigates the estimates of random expectation of correct grouping by randomising predicted complexes sets. A set of predicted complexes from original networks are randomised by shuffling nodes between different complexes while keeping the number of complexes, and the sizes of corresponding complexes, unchanged. The resulting set of permuted clusters is then evaluated by quality measures using goldstandards. If quality scores of original set of generated clusters are close to those of the random set, it indicates that the corresponding clustering algorithm yields a set of predicted complexes which is not significantly better than a randomly generated set of complexes.
The process of creating permuted clusters is as follows. The original set of generated clusters was concatenated into a list of proteins. Then the FisherYates shuffle [31, 32] was applied to the list of proteins. The procedure of shuffling was repeated 1,000 times, and then the list was divided into groups in a way that preserves the sizes of original complexes and the number of complexes. This grouping was then evaluated by each quality measure. Since the FisherYates shuffle chooses any possible permutation of a list with equal probability, the resulting set of permuted clusters can be used to obtain an unbiased estimate for the expected value of any chosen quality score.
The permutation process was repeated 1,000 times, resulting in 1,000 clustering sets. Each clustering set was evaluated by those quality scores and the average score corresponding to each metric was calculated. The pvalue is obtained by calculating the number of times that a randomised set of clustering results had a higher value in quality scores than that of the original clustering set, divided by the total number of permutations, which is 1,000 here. If pvalue is less than 0.05, it indicates that the high performance achieved by the proposed algorithm is unlikely to occur by chance. In this study, we use the Bonferonni correction to counteract the problem of multiple comparisons [33].
Without loss of generality, we only use one goldstandard dataset, CYC2008. Table 9 displays the expected values of BGCA+BP on Gavin_2006 network and Krogan_2006 network using CYC2008 goldstandard, respectively. The quality scores employed to measure the effect of randomised clusters include the fraction of matched complexes, geometric accuracy, geometric homogeneity, PR values, Jaccard FMeasure and BHFMeasure.
It can be observed that the average quality scores in case of Jaccard FMeasure and BHFMeasure are close to zero. Though the values of geometric accuracy, geometric homogeneity, and PR value are higher, they are still very small, compared with those of the original set. Very low pvalues indicate that the original set of clusters is significantly better than the randomised clustering sets.
D. Robustness of the proposed algorithm
In order to evaluate the robustness of the proposed algorithms to false positives and false negatives, various levels of alteration have been made by adding or deleting percentages of edges with respect to the number of edges in the original Gavin_2006 network. The strategy of altering graph in [11] is adopted in the study. Increasing fraction of edges (0%, 5%, 10%, 20%, 40%, 80%, 100%) are randomly added to the original graph. Similarly, increasing fraction of edges (0%, 5%, 10%, 20%, 40%, 80%) are randomly deleted from the original network. Specifically, the proportion of edges which are added or removed is obtained based on the number of edges in the original graph. Take the Gavin_2006 network as an example, 5% edges are equal to 964 edges (5% of 19,277 edges). In the experiment, the Network Analysis Tools (NeAT) [34] has been applied to alter the network. Note, in the alteration of graphs applied in the study, selfloops and duplicated edges are not allowed.
In order to demonstrate the advantage of incorporating semantic similarity, in the experiment, the performance of BGCA in the context of detecting protein complexes from randomly altered graphs is also presented. Geometric accuracy and BHFMeasure were used to demonstrate the impact on clustering results of BGCA by introducing noises into the network. Figure 1 and Figure 2 present the impact on geometric accuracy and BHFMeasure of the BGCA and BGCA+BP, when edges were randomly added.
Observation can be made from Figure 1, as for BGCA+BP, the curve representing the geometric accuracy is smooth. The geometric accuracy increases slightly first since 5% edges were added, and the highest value is obtained when 40% edges were added. The geometric accuracy starts to decline when more than 40% edges were added. However, the change in geometric accuracy is still trivial even when 100% edges were added compared to that in the original graph. The curve represented that the BHFMeasure fluctuates slightly in the interval when edges were added increasingly from 5% to 20%. The best value is obtained when 5% edges were added and then the BHFMeasure drops and rises again when 20% edges were added. When more than 20% edges were added, the BHFMeasure declines greatly but the curve becomes smooth after 80% and 100% edges were added. With regard to BGCA, the curve representing geometric accuracy of the BGCA drops drastically as 5% edges were randomly added to the original graph. When adding 40% edges, the value of geometric accuracy of the BGCA falls down to 0, since there are no generated clusters which match to any benchmark complexes. Similar observations can be obtained from Figure 2. With semantic similarity, the BGCA+BP demonstrate much more robustnes than the BGCA in the case of randomly adding edges to the original graph.
Figure 3 and Figure 4 present the impact on geometric accuracy and BHFMeasure when randomly deleting edges from the original graph. The geometric accuracy of BGCA+BP is affected slightly until more than 40% edges were deleted from the original graph. The BHFMeasure also drops when removing 40% edges from the original graph. The value of geometric accuracy and BHFMeasure of the BGCA+BP only drops when more than 40% edges are removed. It shows that the BGCA+BP is also robust to edge deletion. As for BGCA, the trend of curves representing both geometric accuracy and BHFMeasure is similar. The curves keep almost unchanged after a drop when the fraction of deleted edges is increased from 0% to 5%, demonstrating the BGCA is relatively robust in the case of edge deletion, compared with that of edge addition.
From these observations, it can be concluded that by incorporating semantic similarity, the proposed algorithm is quite robust to the noises in PPI networks.
Conclusions
In this paper, we propose a new algorithm combining topological features and semantic similarities between proteins to discover protein complexes in TAPMS PPI networks. The proposed algorithm is extended from a previously proposed algorithm, i.e. BGCA [17]. It has been tested on two published TAPMS PPI networks, Gavin_2006 network and Krogan_2006 network. The proposed algorithm inherits the main feature of BGCA which is that it detects protein complexes by taking cocomplex relations into account from TAPMS data. Results indicate that by integrating GOdriven similarity knowledge into clustering process, the proposed algorithm outperforms BGCA as well as several stateofart clustering techniques. Not only a higher accuracy has been achieved, the proposed algorithm also significantly improves the robustness of BGCA to the noise inherent in protein interaction data generated by TAPMS.
In this paper, the strategy of combining topological similarity and semantic similarity in BGCA is developed by calculating the average value, in which the weights assigned to semantic similarity and topological similarity are the same. The behaviour of the algorithm by using other weighting schemes deserves further investigation. Moreover, incorporating other types similarity information, such as those derived from CC and MF ontologies [20] into the algorithm for further improvement will be considered as well.
Quality measures
This section introduces quality measures that have been used in the study. These quality measures calculate the degree of agreement between predicted clusters obtained by clustering algorithms and wellstudied clusters in a reference set. In application to identify complexes in PPI networks, the reference set can be built from goldstandard databases, such as CYC2008 [28] and MIPS [27]. Generally, the value of these quality measures falls into the interval between 0 and 1. The higher the value, the better quality of clustering and better performance a clustering algorithm has.
Let $C$ be the set of predicted clusters and $M$ be the set of benchmark protein complexes. Let $n$ be the number of clusters in $C$, and $m$ be the number of complexes, then a $n\times m$ confusion matrix $Z$ is constructed for comparison between predicted clusters and goldstandard complexes. The ${i}^{th}$ row corresponds to candidate cluster $i$ while the ${j}^{th}$ column stands for benchmark complex $j$. The entry ${z}_{ij}$ represents size of intersection between ${i}^{th}$ row and ${j}^{th}$ column, which is the number of proteins which are identified as members in cluster $i$ and also belongs to complex $j$ as well. ${z}_{i}$ is the size of ${i}^{th}$ cluster while ${z}_{j}$ represents size of ${j}^{th}$ complex.

Sensitivity, Positive Predictive Value (PPV), and Geometric Accuracy
Geometric accuracy, which was proposed by Brohée and Helden [11], measures degree of correspondence between the set of predicted clusters and the set of benchmark complexes. Geometric accuracy contains two other parameters, sensitivity and PPV.
Sensitivity is defined as the proportion of proteins of benchmark complex $j$ which are identified in the predicted cluster $i$. The general sensitivity is obtained by the weighted average of maximal sensitivity of each complex over all complexes
PPV represents the maximal fraction of a predicted cluster $i$ belongs to the same benchmark complex. It indicates the reliability with which predicted cluster $i$ predicts that a protein belongs to its bestmatching benchmark complex. $\sum _{j=1}^{m}{z}_{ij}$ is the marginal sum of the predicted cluster $i$.
Geometry accuracy is defined as the geometric mean of the product general sensitivity and PPV,
Accuracy reflects the tradeoff between sensitivity and PPV. A high accuracy value requires a high performance for both measures. The higher the accuracy values the better quality of a clustering result.

Homogeneity
Homogeneity [35], called separation by Brohée and Helden [11], provides a measure of degree of bidirectional correspondence between a predicted cluster and a benchmark complex. It is the product of the fraction of proteins found in a cluster by the fraction of proteins annotated in the complex, relative to the marginal sum of the row or the column.
The clusterwise homogeneity $h{M}_{c{l}_{i}}$ is defined to represent the frequency of distribution of proteins detected as members in the same cluster $i$ over annotated complexes. The clusterwise homogeneity $h{M}_{c{l}_{i}}$ calculates the sum of the homogeneity value for a cluster $i$,
Similarly, Complexwise homogeneity $h{M}_{c{o}_{j}}$ shows the frequency of the fraction of proteins in a same benchmark complex $j$ over all the predicted clusters. The complexwise homogeneity $h{M}_{c{o}_{j}}$ is calculated as the sum of homogeneity value for a benchmark complex, that is,
To measure the general clusterwise homogeneity $h{M}_{cl}$ and complexwise homogeneity $h{M}_{co}$, the average values of $h{M}_{c{l}_{i}}$ and $h{M}_{c{o}_{j}}$ over all predicted clusters and benchmark complexes are calculated, respectively.
To estimate general homogeneity over a clustering, the general homogeneity $hM$ is defined as the geometric mean of the product of general clusterwise homogeneity and complexwise homogeneity.
Homogeneity reflects relative ratio of distribution of overlapping intersections between annotated complexes and generated clusters. When proteins are allowed to be assigned to multiple clusters, the value clusterwise homogeneity will be lower and thus the general homogeneity value will be lower.

Precision, Recall and PRvalue
In a clustering task, the precision is defined as the fraction of True Positives (TPs) which are correctly labelled items in the predicted class, and recall is the fraction of TPs in a reference class [29]. In the context of detection of protein complexes in PPI networks, precision of cluster $i$ is the number of TPs divided by the size of this cluster while recall of complex $j$ is the number of TPs divided by the size of the benchmark complex [29]. Here, TPs are proteins found in the predicted cluster and also annotated in the benchmark complex [29]. The number of TPs between cluster $i$ and complex $j$ is equal to the size of intersection in the confusion table defined as above. Thus, precision $P{r}_{ij}$ and recall $R{e}_{ij}$ of cluster $i$ and complex $j$ are computed as follows:
where ${z}_{i}$ and ${z}_{j}$ represents size of predicted cluster $i$ and size of benchmark complex $j$, respectively. The maximal precision value for cluster $i$ over all benchmark complexes is used as precision of the predicted cluster $i$.
The recall for the benchmark complex $j$ is defined as:
Recall reveals how well a benchmark complex is covered by the corresponding cluster. Precision here is obtained by dividing the size of the local cluster, measuring percentage of TPs in the local cluster.
A general precision is obtained by calculating the weighted average of precision over all predicted clusters.
The general recall also uses the weighted average of recall values over all benchmark complexes,
PR value is the harmonic mean of precision and recall, used to reflect the degree of TPs predicted in a clustering as well as general correspondence between predicted clusters and benchmark complexes.

BHSensitivity and BHspecificity
A different definition of sensitivity from the one which was proposed by Brohée and Helden [11] was used by Bader and Hogue [10]. In order to differentiate the sensitivity used by Broheé and Helden [11], the sensitivity and specificity introduced in this section are referred as BHSensitivity and BHSpecificity, where BH is the initials of the authors, Bader and Hogue [10]. In the set of predicted clusters, the numbers of TPs, True Negatives (TN), FPs and FNs depend on how threshold is selected relative to sets of goldstandard complexes. An overlap score $w$ was proposed to measure how significantly a predicted cluster matches a benchmark complex by Bader and Hogue in 2003 [10].
Where ${z}_{ij}$ represents the number of overlapping proteins between the predicted cluster $i$ and the benchmark complex $j$, ${z}_{i}$ is the size of predicted cluster $i$ and ${z}_{j}$ is the size of the benchmark complex $j$.
The number of TP is defined as the number of predicted clusters with $w$ over a threshold value and the number of FP is the total number of predicted clusters minus TP. The number of FN is defined as the number of benchmark complexes that are not matched by predicted clusters, while the number of TN is the number of benchmark complexes that are matched by predicted clusters with $w$ over a threshold value. The formula used to calculate sensitivity and specificity are presented below:
In this study, the threshold value of $w$ is set to 0.2. The fmeasure value of BHsensitivity and BHspecificity is also employed to measure the overall performance of a clustering algorithm.

Jaccard index
Extended from Jaccard similarity measure [26], Jaccard index calculates the fraction of intersection between a predicted cluster and a benchmark complex over the union set of the cluster and benchmark complex [29].
In order to measure how well the group of predicted clusters map to benchmark complexes, for each cluster $i$, the benchmark complex $j$ that maximises overlap between itself and the cluster $i$ is found, that is,
Where $\left{z}_{i}\cup {z}_{j}\right$ represents the size of the union set of predicted cluster $i$ and benchmark complex $j$. Then, a weight average of clusterwise Jaccard index is calculated over all predicted clusters, that is,
Similarly, as to measure how well a set of benchmark complexes correspond to the set of predicted clusters, a complexwise Jaccard index is calculated. First, for each benchmark complex $j$, a maximum Jaccard index is obtained by
Then, the complexwise Jaccard index over the set of benchmark complexes is calculated,
Finally, the general Jaccard index is defined as the harmonic mean of $Ja{c}_{cl}$ and $Ja{c}_{co}$, that is:
Jaccard measure reflects the degree of bidirectional correspondence between the set of predicted clusters and the group of benchmark complexes. Higher Jaccard measure value indicates that predicted clusters very well match to the group of benchmark complexes and vice versa.
References
 1.
Albert B: "The cell as a collection of protein machines: Preparing the next generation of molecular biologist,". Cell 1998, 92: 291–294. 10.1016/S00928674(00)809228
 2.
Hartwell L, Hopfield J, Leibler S, Murray A: "From molecular to modular cell biology,". Nature 1999, 402: C47C52. 10.1038/35011540
 3.
Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: "A comprehensive twohybrid analysis to explore the yeast protein interactome,". Proc Natl Acad Sci USA 2001, 98: 4569–4574. 10.1073/pnas.061034498
 4.
Uetz P, Glot L, Cagney G, Mansfield TA, Judson RS, Knight JR, et al.: "A comprehensive analysis of proteinprotein interactions in Saccharomyces cerevisiae,". Nature 2000, 403: 623–627. 10.1038/35001009
 5.
Gavin AC, Bösche M, Krause R, Grandl P, Marzloch M, Baer A, et al.: "Functional organization of the yeast proteome by systematic analysis of protein complexes,". Nature 2002, 415: 141–147. 10.1038/415141a
 6.
Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, et al.: "Proteome survey reveals modularity of the yeast cell machinery,". Nature 2006, 440: 631–636. 10.1038/nature04532
 7.
Yu J, Fotouhi F: "Computational approaches for predicting proteinprotein interactions: a survey,". J Med Sys 2006, 30: 39–44. 10.1007/s1091600674023
 8.
Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, et al.: "Global landscape of protein complexes in the yeast Saccharomyces cerevisiae,". Nature 2006, 440: 637–643. 10.1038/nature04670
 9.
van Dongen S: Graph Clustering by Flow Simulation [Ph.D Dissertation]. Centers for Mathematics and Computer. Science, University of Utrecht; 2000.
 10.
Bader GD, Hogue CW: "An automated method for finding molecular complexes in large protein interaction networks,". BMC Bioinformatics 2003, 4: 2. 10.1186/1471210542
 11.
Brohée S, van Helden J: "Evaluation of clustering algorithms for proteinprotein interaction networks,". BMC Bioinformatics 2006, 7: 488. 10.1186/147121057488
 12.
Adamcsek B, Palla G, Farkas IJ, Derenyi I, Vicsek T: "CFinder: locating cliques and overlapping modules in biological networks,". Bioinformatics 2006, 22: 1021–1023. 10.1093/bioinformatics/btl039
 13.
Macropol K, Can T, AK Singh: "RRW: repeated random walks on genomescale protein networks for local cluster discovery,". BMC Bioinformatics 2009, 10: 283. 10.1186/1471210510283
 14.
Wu M, Li X, Kwoh CK, Ng SK: "A coreattachment based method to detect protein complexes in PPI networks,". BMC Bioinformatics 2009, 10: 169. 10.1186/1471210510169
 15.
Scholtens D, Vidal M, Gentleman R: "Local modeling of global interactome networks,". Bioinformatics 2005, 21: 3548–3557. 10.1093/bioinformatics/bti567
 16.
Geva G, Sharan R: "Identification of protein complexes from coimmunoprecipitation data,". Bioinformatics 2011, 27: 111–117. 10.1093/bioinformatics/btq652
 17.
Cai B, Wang HY, Zheng H, Wang H: "Detection of protein complexes from Affinity Purification/Mass Spectrometry data,". BMC Systems Biology 2012, 6: s4.
 18.
Azuaje F, Wang HY, Zheng H, Bodenreider O, Chesneau A: "Predictive integration of gene ontologydriven similarity and functional interactions,". Proceeding of the 6th IEEE International Conference on Data Mining 2006, 114–119.
 19.
Pesquita C, Faria D, Falcão AO, Lord P, FM Couto: "Semantic similarity in biomedical ontologies,". PLos Comput Biol 2009, 5: e1000443. 10.1371/journal.pcbi.1000443
 20.
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al.: "Gene Ontology: tool for the unification of biology,". Nat Genet 2000, 25: 25–29. May 2000 10.1038/75556
 21.
Jiang J, Conrath DW: "Semantic similarity based on corpus tatistics and lexcial taxonomy,". In Proceedings of International Conference Research on Computational Linguistics. Taiwan; 1997:19–33.
 22.
Resnik P: "Using information content to evaluate semantic similarity in a taxonomy,". In Proceedings of the 14th International Joint Conference on Artificial Intelligence. San Francisco, CA, USA; 1995:448–453.
 23.
Lin D: "An informationtheoretic definition of similarity,". In Proceedings of 15th International Conference on Machine Learning. Madison, Wisconsin, USA; 1998:296–304.
 24.
Azuaje F, Bodenreider O: "Incorporating ontologydriven similarity knowledge into functional genomics: An exploratory study,". Proceeding of the IEEE Fourth Symposium on Bioinformatics and Bioengineering (BIBE2004) 2004, 317–324.
 25.
Cai B, Wang HY, Zheng H, Wang H: "Incorporating semantic similarity into clustering process for identifying protein complexes from affinity purification/mass spectrometry data,". In Proceeding of IEEE International Conference on Bioinformatics and Biomedicine (BIBM). Philadelphia, PA, USA; 2012:1–4.
 26.
Jaccard P: "Étude comparative de la distribution florale dans une portion des Alpes et des Jura,". Bulletin De La Société Vaudoise Des Sciences Naturelles 1901, 37: 547–579.
 27.
Mewes HW, Frishman D, Mayer KFX, Muensterkoetter M, Noubibou O, Pagel P, et al.: "MIPS: analysis and annotation of proteins from whole genomes in 2005,". Nucleic Acids Res 2006, 34: D169D172. 10.1093/nar/gkj148
 28.
Pu S, Wong J, Turner B, Cho E, SJ Wodak: "Uptodate catalogues of yeast protein complexes,". Nucleic Acids Res 2009, 37: 825–831. 10.1093/nar/gkn1005
 29.
Song J, Singh M: "How and when should interactomederived clusters be used to predict functional modules and protein function?". Bioinformatics 2009, 25: 3143–3150. 10.1093/bioinformatics/btp551
 30.
Enright AJ, van Dongen S, Ouzounis CA: "An efficient algorithm for largescale detection of protein families,". Nucleic Acids Res 2002, 30: 1575. 10.1093/nar/30.7.1575
 31.
Fisher RA, Yates F: Statistical Tables for Biological, Agricultural and Medical Research 6th Edition. Edinburgh: Oliver & Boyd; 1948.
 32.
Durstenfeld R: "Algorithm 235: Random permutation,". Communications of the ACM 1964, 7: 420.
 33.
Bland JM, Altman DG: "Multiple significance tests: The Bonferroni method,". BMJ 1995, 310: 170. 10.1136/bmj.310.6973.170
 34.
Brohée S, Faust K, LimaMendez G, Sand O, Janky R, Vanderstocken G, et al.: "NeAT: a toolbox for the analysis of biological networks, clusters, classes and pathways.". Nucleic Acids Research 2008, 36: W444W451. 10.1093/nar/gkn336
 35.
Zheng H, Wang HY, Glass DH: "Integration of genomic data for inferring protein complexes from global proteinprotein interaction networks,". IEEE Transaction on Systems, Man, and CyberneticsPart B: Cybernetics 2008, 38: 5–18.
Acknowledgements
Miss Bingjing Cai is supported by the Vice Chancellor's Research Scholarships, University of Ulster, UK.
Declaration
The publication costs for the article will be funded by the Computer Science Research Institute, University of Ulster.
This article has been published as part of Proteome Science Volume 11 Supplement 1, 2013: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2012: Proteome Science. The full contents of the supplement are available online at http://www.proteomesci.com/supplements/11/S1.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
BC contributed to algorithms design and carried out all programming and analyses as a Ph.D student in the University of Ulster. HYW, HZ and HW supervised this study, guided algorithms development, data analysis and contributed to the preparation of this manuscript. All authors read and approved the final manuscript.
Rights and permissions
About this article
Cite this article
Cai, B., Wang, H., Zheng, H. et al. Integrating domain similarity to improve protein complexes identification in TAPMS data. Proteome Sci 11, S2 (2013). https://doi.org/10.1186/1477595611S1S2
Published:
Keywords
 Positive Predictive Value
 Semantic Similarity
 Jaccard Index
 Geometric Accuracy
 Bait Protein