- Open Access
Filtering Gene Ontology semantic similarity for identifying protein complexes in large protein interaction networks
© Wang et al; licensee BioMed Central Ltd. 2012
- Published: 21 June 2012
Many biological processes recognize in particular the importance of protein complexes, and various computational approaches have been developed to identify complexes from protein-protein interaction (PPI) networks. However, high false-positive rate of PPIs leads to challenging identification.
A protein semantic similarity measure is proposed in this study, based on the ontology structure of Gene Ontology (GO) terms and GO annotations to estimate the reliability of interactions in PPI networks. Interaction pairs with low GO semantic similarity are removed from the network as unreliable interactions. Then, a cluster-expanding algorithm is used to detect complexes with core-attachment structure on filtered network. Our method is applied to three different yeast PPI networks. The effectiveness of our method is examined on two benchmark complex datasets. Experimental results show that our method performed better than other state-of-the-art approaches in most evaluation metrics.
The method detects protein complexes from large scale PPI networks by filtering GO semantic similarity. Removing interactions with low GO similarity significantly improves the performance of complex identification. The expanding strategy is also effective to identify attachment proteins of complexes.
- Gene Ontology
- Semantic Similarity
- Attachment Protein
- Ontology Structure
- Semantic Similarity Measure
Protein complexes are important molecular entities in cellular organizations. With large amounts of protein interactions produced by high-throughput experimental techniques [1, 2], protein complexes are able to be automatically identified from genome-scale interaction networks by computational approaches. Generally, proteins in a complex share more interactions among themselves than with other proteins . Many algorithms, based on graph theory, have been proposed to identify protein complexes by detecting dense regions in PPI networks, such as MCODE , MCL , and CFinder . However, their performance is affected by the false-positive interactions in the network. In some experiments, the proportion of false-positive interactions generated by high-throughput techniques is estimated to be up to 50% . It is reasonable to make use of biological information to measure the reliability of interaction pairs or predicted complexes. For example, protein function annotation datasets are used in RNSC  and DECAFF  to filter complexes with low functional homogeneity or reliability.
GO annotation is a useful information resource to measure the reliability of protein interaction pairs. The GO project maintains three structured controlled vocabularies, which describe gene products in terms of their associated biological processes, cellular components, and molecular functions . The ontology of each domain is structured as a directed acyclic graph (DAG), which organizes terms by their relationships. The similarity of two gene products based on GO annotations can be considered as the similarity of two sets of GO terms. The semantic similarity of GO terms can be measured by the topological information in the ontology structure.
In this paper, we attempt to make use of GO annotations and the ontology structure of GO terms to measure semantic similarity of GO terms and proteins. The similarity of two GO terms is measured based on their average distance to their lowest common ancestors in the ontology structure. Semantic similarity between proteins is computed as the similarity of two sets of GO terms, which annotate the two proteins respectively. PPIs in the network are then weighted by the similarity of interacting proteins for the filtering and clustering steps. As far as we know, most approaches filter the predicted complexes with low density or statistical significance in post processes [4, 9, 11, 12], which still introduce some unreliable interactions in the results. In our method, however, the low-weight interactions are filtered first, followed by a cluster-expanding algorithm to identify high quality complexes consisting of only reliable interactions. Considering the core-attachment structure revealed by Gavin et al. , which reflects the inherent organization of protein complexes, we propose a network clustering algorithm to identify the core and attachment proteins of complexes successively. Firstly, cliques in the filtered network are detected. Highly overlapping cliques are merged to form cores of complexes. Secondly, we add attachment proteins to the cores, making use of the cluster-expanding strategy in RRW algorithm , which is appropriate for expanding clusters consisting of multiple nodes in weighted networks. By applying the clustering algorithm on the purified PPI network, our method identifies complexes with high biological significance and functional homogeneity.
Computing protein semantic similarity for every pair of proteins with interaction in the PPI network.
Removing interactions with low similarity from the original network.
Finding cliques in the filtered network to form complex cores. Multiple highly overlapping cliques are merged to form one core.
Adding attachment proteins to these cores with the expanding strategy in RRW algorithm.
Semantic similarity for PPI
The GO database is currently one of the most comprehensive and well-curated ontology databases in the bioinformatics community. The ontology structure of GO terms is organized as DAGs of three domains with terms as nodes and their relationships as directed edges. The GO terms are structured by two kinds of relationships to each other: "is-a" and "part-of", representing specific-to-general and part-to-whole relations respectively.
where root denotes a virtual node as the parent node of the three root nodes of three distinct DAGs (biological process, cellular component and molecular function) in GO. d a (root, x) denotes the average length of paths from root to x passing through a, d a (root, x)=d(root, a)+d(a, x). Sim(x, y) reaches its minimum value zero when x and y are terms in different domains, while it reaches its maximum value 1 when x and y are the same term.
T A and T B denote the term sets annotating protein A and B respectively. For every term x in T A , we find the most similar terms in T B to calculate , and vice versa. Then we consider the average value of these term-pair similarity values as the similarity of protein A and B, which is also a uniform result.
We use PSim similarity to weight every pair-wise interaction in the PPI network. Considering the inaccuracy of interaction network, we remove the interactions with a PSim value no larger than a threshold filter_thres. Only high quality interactions are involved in the following complex identification steps.
The core-attachment structure  provides an insight view of inherent organization of protein complexes. Several methods such as COACH  and CORE  have made good use of this characteristic to detect protein complexes from PPI networks. The core proteins of a complex have relatively more interactions among themselves and share a high degree of functional similarity. Attachment proteins are the surrounding proteins of the core performing relative functions.
where V A and V B denote the node sets of clique A and B respectively. All of the merged clusters and cliques not involved in the merging form core set of the complex. Attachment proteins are added to each core by the expanding strategy of RRW algorithm . RRW is an appropriate algorithm for cluster expanding as it simulates a random walk with a restart probability starting from multiple nodes in a network. After computing the stationary vector of every single node in network, the RRW algorithm expands clusters starting from every node, adding one node to the cluster and saving the expanded cluster in each expanding step. Then, the clusters are sorted and filtered by their statistical significance. Since this filtering strategy tends to generate relatively small sized clusters, we use the expanding strategy to run repeated random walk from every core protein set with neighbor nodes, and only add the maximal expansion of each cluster to the result set. The original parameters of the minimum and maximum cluster size of RRW are 5 and 11, while the size distributions of hand-curated complexes from CYC2008 , Aloy  and MIPS  indicate that most complexes are of a size between 2 and 20. We set the parameters to 2 and 20 respectively in our method while other parameters are set to default.
Details of interaction datasets
Number of proteins
Number of interactions
The GO resource we used can be downloaded from http://www.geneontology.org/ with version 1.2028, dated 06/10/2011. The version of the annotation file of Saccharomyces cerevisiae is 1.1566 submitted on 06/18/2011.
We evaluate the experimental result with six evaluation metrics: precision (P), recall (R), F-measure (F), sensitivity (Sn), PPV and accuracy (Acc), which are described in . A predicted complex is matched with a benchmark complex if their NA is above 0.2, which is used in most approaches.
With the increase of filter_thres, the precision rises in general, indicating that high accurate complexes can be identified from high quality interactions. Therefore, removing interaction pairs with low similarity significantly improves the performance of complex identification. The GO semantic similarity measure we proposed is effective in estimating the quality of PPI. The F-measure reaches maximum when filter_thres is set to an optimal value 0.6, which is also validated by other combinations of network and benchmark datasets. In addition, we found that the number of predicted complexes is inversely proportional to filter_thres. This number is above 1,000 when filter_thres is less than 0.3, which seems unreasonable for a network with 3581 nodes. This is because the clique finding algorithm  generates cliques starting from every nodes in network. Many of these cliques have a high proportion of common nodes. It is necessary to merge the large amounts of overlapping cliques.
Comparison with other approaches
Sensitivity, PPV and accuracy are metrics evaluating the correspondence between the prediction and benchmark in micro level. Sensitivity represents the coverage of a complex by its best-matching cluster (the maximal fraction of proteins in the complex found in a common cluster), while PPV measures how well a given cluster predicts its best-matching complex . Accuracy is the geometric average of sensitivity and PPV. By reaching an average level of these evaluation metrics, our method can generate complexes matching more real complexes accurately.
Performance comparison of various approaches on Gavin-CYC2008
Performance comparison of various approaches on Krogan-CYC2008
Performance comparison of various approaches on DIP-CYC2008
Examples of predicted complexes
Examples of predicted complexes
GO biological processes
GO molecular functions
GO cellular components
YGR095C YDL111C YGR158C
YCR035C YOL142W YHR069C
YOR001W YHR081W YDR280W
YNL232W YOL021C YGR195W
YPL243W YML105C YKL122C
YPR088C YDL092W YPL210C
YBR060C YPR162C YNL261W
YHR118C YML065W YLL004W
YHL025W YJL176C YNR023W
YOR290C YFL049W YPR034W
YBR289W YMR033W YPL129W
YLR071C YGR104C YOR174W
YER022W YOL135C YHR041C
YGL025C YDR443C YBR253W
YNL236W YHR058C YOL051W
YMR112C YNR010W YBR193C
YPR070W YPR168W YCR081W
YLR357W YFR037C YPR034W
YBR245C YFR013W YPL235W
YOR304W YDR190C YCR052W
YKR008W YGL133W YDR303C
YDR416W YAL032C YMR288W
YMR213W YHR165C YGR278W
YLR117C YDL209C YPL151C
YNL252C YML025C YDR116C
YNL284C YGR220C YCR046C
Computational approaches for protein complex detection are often affected by false-positive interactions in large scale PPI data. In this paper, we identify protein complexes in PPI networks with a two-phase method. We first measure the semantic similarity of GO terms and proteins by the ontology structure to evaluate the reliability of PPIs. After removing unreliable proportion of interactions, a core-attachment based clustering method is applied to the filtered network for complex identification. The main contributions of this paper are: 1) proposing a graph-based GO semantic similarity measure to purify the PPI network, 2) designing a core-attachment detection algorithm making use of the RRW algorithm to detect complexes from the filtered network.
By comparing with various approaches, our method outperforms the other approaches in overall evaluations. The graph-based similarity measure enhances the complex identification performance. Removing unreliable interactions before clustering improves the performance significantly. The strategy of expanding clusters by RRW algorithm is also effective to identify the attachment proteins in protein complexes. A future research can focus on the similarity measure of PPI in the network. Various measuring method can be applied to estimate the reliability of protein pairs to filter the false-positive interactions.
This work is partly supported by a grant from the Natural Science Foundation of China (No. 60973068 and 61070098), the National High Tech Research and Development Plan of China (No.2006AA01Z151) and the Fundamental Research Funds for the Central Universities (No. DUT10JS09).
This article has been published as part of Proteome Science Volume 10 Supplement 1, 2012: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2011: Proteome Science. The full contents of the supplement are available online at http://www.proteomesci.com/supplements/10/S1.
- Ito T, Chiba T, Ozawa R, Yoshida M, Sakaki Y: A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA 2001,98(8):4569–4574. 10.1073/pnas.061034498PubMed CentralPubMedView ArticleGoogle Scholar
- Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, et al.: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 2002,415(6868):180–183. 10.1038/415180aPubMedView ArticleGoogle Scholar
- Bork P, Jensen LJ, von Mering C, Ramani AK, Lee I, Marcotte EM: Protein interaction networks from yeast to human. Current Opinion in Structural Biology 2004,14(3):292–299. 10.1016/j.sbi.2004.05.003PubMedView ArticleGoogle Scholar
- Bader GD, Hogue CW: An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 2003, 4: 2. 10.1186/1471-2105-4-2PubMed CentralPubMedView ArticleGoogle Scholar
- Enright AJ, Van Dongen S, Ouzounis CA: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 2002,30(7):1575–1584. 10.1093/nar/30.7.1575PubMed CentralPubMedView ArticleGoogle Scholar
- Adamcsek B, Palla G, Farkas IJ, Derényi I, Vicsek T: CFinder: locating cliques and overlapping modules in biological networks. Bioinformatics 2006,22(8):1021–1023. 10.1093/bioinformatics/btl039PubMedView ArticleGoogle Scholar
- von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fileds S, Bork P, et al.: Comparative assessment of large-scale data sets of protein-protein interactions. Nature 2002,417(6887):399–403.PubMedView ArticleGoogle Scholar
- King A, Przulj N, Jurisica I: Protein complex prediction via cost-based clustering. Bioinformatics 2004,20(17):3013–3020. 10.1093/bioinformatics/bth351PubMedView ArticleGoogle Scholar
- Li X, Foo C, Ng S: Discovering protein complexes in dense reliable neighborhoods of protein interaction networks. Comput Syst Bioinformatics Conf 2007, 6: 157–168.PubMedView ArticleGoogle Scholar
- The gene ontology (GO) project in 2006 Nucleic Acids Res 2006, 34: 322–326. 10.1093/nar/gkj439Google Scholar
- Macropol K, Can T, Singh AK: RRW: repeated random walks on genome-scale protein networks for local cluster discovery. BMC Bioinformatics 2009, 10: 283. 10.1186/1471-2105-10-283PubMed CentralPubMedView ArticleGoogle Scholar
- Liu G, Wong L, Chua HN: Complex discovery from weighted PPI networks. Bioinformatics 2009,25(15):1891–1897. 10.1093/bioinformatics/btp311PubMedView ArticleGoogle Scholar
- Gavin A, Aloy P, Grandi P, Krause R, Boesche M, et al.: Proteome survey reveals modularity of the yeast cell machinery. Nature 2006,440(7084):631–636. 10.1038/nature04532PubMedView ArticleGoogle Scholar
- Pekar V, Staab S: Taxonomy learning: factoring the structure of a taxonomy into a semantic classification decision. Proceedings of the 19th International Conference on Computational Linguistics 2002, 1: 1–7.View ArticleGoogle Scholar
- Schlicker A, Domingues FS, Rahnenführer J, Lengauer T: A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics 2006, 7: 302. 10.1186/1471-2105-7-302PubMed CentralPubMedView ArticleGoogle Scholar
- Wu M, Li X, Kwoh CK, Ng S: A Core-Attachment based Method to Detect Protein Complexes in PPI Networks. BMC Bioinformatics 2009, 10: 169. 10.1186/1471-2105-10-169PubMed CentralPubMedView ArticleGoogle Scholar
- Leung HC, Yiu SM, Xiang Q, Chin FY: Predicting Protein Complexes from PPI Data: A Core-Attachment Approach. Journal of Computational Biology 2009,16(2):133–144. 10.1089/cmb.2008.01TTPubMedView ArticleGoogle Scholar
- Tomita E, Tanaka A, Takahashi H: The worst-case time complexity for generating all maximal cliques and computational experiments. Theoretical Computer Science 2006,363(1):28–42. 10.1016/j.tcs.2006.06.015View ArticleGoogle Scholar
- Pu S, Wong J, Turner B, Cho E, Wodak SJ: Up-to-date catalogues of yeast protein complexes. Nucleic Acids Res 2009,37(3):825–831. 10.1093/nar/gkn1005PubMed CentralPubMedView ArticleGoogle Scholar
- Aloy P, Bottcher B, Ceulemans H, et al.: Structure-based assembly of protein complexes in yeast. Science 2004,303(5666):2026–2029. 10.1126/science.1092645PubMedView ArticleGoogle Scholar
- Mewes HW, Amid C, Arnold R, et al.: MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res 2004, 32: D41-D44. 10.1093/nar/gkh092PubMed CentralPubMedView ArticleGoogle Scholar
- Krogan N, Cagney G, Yu H, Zhong G, Guo X, et al.: Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 2006,440(7084):637–643. 10.1038/nature04670PubMedView ArticleGoogle Scholar
- Xenarios I, Salwinski L, Duan X, Higney P, Kim S, Eisenberg D: DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Research 2002, 30: 303–305. 10.1093/nar/30.1.303PubMed CentralPubMedView ArticleGoogle Scholar
- Dwight SS, Harris MA, Dolinski K, et al.: Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO). Nucleic Acids Research 2002,30(1):69–72. 10.1093/nar/30.1.69PubMed CentralPubMedView ArticleGoogle Scholar
- Friedel CC, Krumsiek J, Zimmer R: Bootstrapping the Interactome: Unsupervised Identification of Protein Complexes in Yeast. Journal of Computational Biology 2009,16(8):971–987. 10.1089/cmb.2009.0023PubMedView ArticleGoogle Scholar
- Li X, Wu M, Kwoh CK, Ng S: Computational approaches for detecting protein complexes from protein interaction networks: a survey. BMC Genomics 2010,11(Suppl 1):S3. 10.1186/1471-2164-11-S1-S3PubMed CentralPubMedView ArticleGoogle Scholar
- Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G: GO::TermFinder - open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics 2004,20(18):3710–3715. 10.1093/bioinformatics/bth456PubMed CentralPubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.