M-Finder: Uncovering functionally associated proteins from interactome data integrated with GO annotations
© Cho et al.; licensee BioMed Central Ltd. 2013
Published: 7 November 2013
Protein-protein interactions (PPIs) play a key role in understanding the mechanisms of cellular processes. The availability of interactome data has catalyzed the development of computational approaches to elucidate functional behaviors of proteins on a system level. Gene Ontology (GO) and its annotations are a significant resource for functional characterization of proteins. Because of wide coverage, GO data have often been adopted as a benchmark for protein function prediction on the genomic scale.
We propose a computational approach, called M-Finder, for functional association pattern mining. This method employs semantic analytics to integrate the genome-wide PPIs with GO data. We also introduce an interactive web application tool that visualizes a functional association network linked to a protein specified by a user. The proposed approach comprises two major components. First, the PPIs that have been generated by high-throughput methods are weighted in terms of their functional consistency using GO and its annotations. We assess two advanced semantic similarity metrics which quantify the functional association level of each interacting protein pair. We demonstrate that these measures outperform the other existing methods by evaluating their agreement to other biological features, such as sequence similarity, the presence of common Pfam domains, and core PPIs. Second, the information flow-based algorithm is employed to discover a set of proteins functionally associated with the protein in a query and their links efficiently. This algorithm reconstructs a functional association network of the query protein. The output network size can be flexibly determined by parameters.
M-Finder provides a useful framework to investigate functional association patterns with any protein. This software will also allow users to perform further systematic analysis of a set of proteins for any specific function. It is available online at http://bionet.ecs.baylor.edu/mfinder
PPI data have a central role in understanding functional behavior of proteins. Recent high-throughput techniques [1, 2] have generated the interactome, an entire set of PPIs on the genomic scale. The accumulative PPI data sets of several model organisms are publicly available in a number of open databases . Availability of interactomes has introduced a new paradigm towards functional characterization of proteins [4, 5]. However, the high-throughput experimental and computational methods have also made their outcomes less reliable, causing the presence of a large fraction of false-positive interactions. Therefore, curation of current PPI data sets by integration of other data sources has been strongly demanded [6, 7].
In the last decade, a wide range of computational algorithms have been introduced to predict protein complexes or functional modules from genome-wide PPI networks [8–12]. A functional module represents a set of proteins which participate in the same biological processes. Because of unreliability of the PPI data set, integrative approaches have also been applied to uncover functional organizations and their structures which are hidden in the PPI networks . As one of the integrative approaches, the functional association level (or functional consistency) between interacting proteins can be quantified by a semantic similarity measure which represents a model for measuring closeness in meaning between two or more ontological terms. The semantic measures may be extended to proteins by using the terms to which they are annotated . The use of GO and its annotations  has been commonly suggested to compute the semantic similarity of each interacting protein pair. Higher semantic similarity between two sets of GO terms of two interacting proteins, respectively, indicates that the proteins are more closely associated with each other in terms of their functions. Although there exist some unreliable sources on GO annotation data (e.g., the results from high-throughput experimental and computational methods), they are often adopted as a benchmark for functional characterization of proteins because of their wide coverage on the genomic scale over various model organisms.
A previous study  proposed a computational approach for functional association pattern mining. The proposed method employs a two-step strategy. First, up-to-date PPI data sets are extracted from the BioGRID database  and weighted by two advanced semantic similarity metrics, called simICNP and simICND. In this article, we evaluate their performance by comparing to other previous methods. We investigate whether each interacting pair agrees on other biological features, such as sequence similarity, the presence of common Pfam domains, and core PPIs.
Second, the core of functional association pattern mining is an information flow-based algorithm that runs on the weighted genome-wide PPI network. When a protein is given by a user, this algorithm generates a group of proteins functionally associated with the protein and their functional links by random walk simulation. Since this algorithm has the advantage of being remarkably efficient, it is well-applicable to web-based tools. In this article, we introduce web application software, called M-Finder, that reconstructs a functional association network from a protein specified by a user. This interactive web-based tool takes a protein entered by a user in a query (using systematic names or gene symbols as protein identifiers) and visualizes a network generated by dynamic information propagation starting from the query protein. The visualized network represents a functional linkage pattern associated with the protein of interest. The output network size can be flexibly determined by the parameters that a user specifies. Moreover, M-Finder provides detailed ontological and experimental information of each interactor and each interaction, which can be obtained through the hyperlinks on the visualized networks. M-Finder will allow users to characterize functional mechanisms of proteins on the genomic scale in a systematic perspective.
Survey of semantic similarity methods
An ontology provides well-defined, structured and computable semantics of domain knowledge . Because of the need for consistent description related to genes and gene products across species, GO has been launched by a collaborative effort to build complete ontologies in the biological domain . Semantic similarity is a function to measure closeness in meaning between ontological terms . Over the past few years, various methods to compute semantic similarity using GO and its annotation data have been proposed [20–22]. The semantic similarity scores have been applied to quantify functional similarity between proteins. According to the components used in GO, we can group the existing methods into four broad categories: edge-based methods (measuring path length between two terms), node-based methods (counting common ancestor terms between two terms), annotation-based methods (measuring information contents of two terms), and integrative methods.
Suppose we measure the semantic similarity between two GO terms t 1 and t 2 having the annotation of two proteins of interest, respectively. First, edge-based methods explore the paths between GO terms in a DAG (Directed Acyclic Graph) structure of GO. For instance, we can compute the shortest path length between t 1 and t 2. Since each ontology has a different scale, the shortest path length between two terms can be normalized by the ontology depth, i.e., the greatest length among the shortest paths from the root to leaf terms. Another example in this category is to measure the depth to the most specific common ancestor term (called SCA) of t 1 and t 2. The greater depth to SCA indicates higher semantic similarity between t 1 and t 2. This method can be normalized by the average depth to the individual GO terms, t 1 and t 2. However, it has been observed that these methods are not appropriate for assessing functional similarity of proteins because GO has inherent complex relationships among GO terms and it cannot be guaranteed that the edges in GO represent the same quantity of specificity.
where C(t 1) is a set of all ancestor terms of t 1.
Schlicker et al.  proposed a combined method of Resnik's and Lin's methods, which is called simRel. If SCA is defined as the term where two paths towards the root from t 1 and t 2 converge, multiple SCAs of t 1 and t 2 generally occur in a DAG structure since each GO term has multiple parent terms. Couto et al.  defined a set of all SCAs of pairwise paths towards the root from t 1 and t 2 as common disjunctive ancestors. They proposed add-on semantic similarity methods, GraSM which averages the information contents of common disjunctive ancestor terms and DiShln which is a slight modification of GraSM .
where C(t 1) is a set of all ancestor terms of t 1. IntelliGO  integrates the edge-based method with information contents as weight. Jain and Bader  proposed an integrative approach, called TCSS, which integrates a clustering technique with a semantic similarity measure. Clustering of GO terms yields a set of subgraphs of GO. Semantic similarity is weighted to allow for inclusion of two terms in the same subgraph.
Summary of semantic similarity methods in four categories.
Path-length between two GO terms
Depth to SCA divided by average depth to two GO terms
The number of common ancestors of two GO terms
Common ancestors divided by union of ancestor sets of two GO terms
IC of SCA of two GO terms
IC of SCA divided by average IC of two GO terms
Sum of differences of ICs between SCA and two GO terms
Average IC of all disjunctive common ancestors of two GO terms
Combination of Resnik's and Lin's methods
Combination of Resnik's and Jiang's methods
Combination of common ancestor terms and their depth
Combination of simUI and ICs of ancestor terms
Combination of depth to two GO terms and ICs of ancestor terms
Combination of Resnik's method and a clustering technique
Combination of Resnik's method and path-length between two GO terms
Previous studies have also observed that the BMA approach is the best for estimating functional similarity between two proteins which perform multiple functions.
Improvement of semantic similarity
where t 0 is SCA of t 1 and t 2 which has the greatest information content among their common ancestor terms. The normalization factor of simICNP is the path length between t 1 and t 2 in the ontology, whereas that of simICND is the difference of information contents of t 1 and t 2. simICNP works better when the ontology has precise information of relationships between specific terms. In contrast, simICND has a better performance when specific terms in the ontology have a sufficient amount of annotations. Therefore, simICND and other annotation-based semantic similarities such as Resnik's method and Lin's method have high accuracy of measuring functional consistency between two proteins for well-studied model organisms. However, their weakness is low accuracy for rarely-studied organisms.
In this study, we make a complete evaluation of simICNP, simICND and other competing semantic similarity methods in a biological perspective, as shown in the next section. For both simICNP and simICND, we use the BMA approach in Formula 1 to achieve functional similarity scores of all interacting protein pairs.
Discovering functional associations
To discover the functional associations (or functional linkage) of a given protein, we apply the information flow algorithm, presented in , to the weighted PPI network. The algorithm is based on the path strength model defined as the product of edge weights divided by node degrees on the path. This model describes that a path, i.e. a series of proteins directly connected, generally has high strength with high edge weights and low node degrees on the path. Starting from a protein that a user specifies, information flow traverses a genome-wide PPI network through all links and updates repeatedly the functional influence score on each protein using the path strength model. The major strength of this approach is high efficiency in scoring functional influence of the user-specified protein on any other proteins in a PPI network with complex connectivity. Recursive random walk computation in this algorithm runs extremely faster than enumerating all possible paths from the user-specified protein to other proteins.
This approach allows to set a parameter to terminate the information flow on a path. As information flows continuously through the links, the algorithm generates monotonically decreasing functional influence scores according to the path strength model. When the score is lower than a user-specified threshold, the flow stops on the specific path. The algorithm finally terminates when any link does not have a flow. The lower the threshold, the longer the algorithm runs.
This approach requires additional parameter to return a functional association network which represents a subgraph of the genome-wide PPI network. When the information flow terminates, the algorithm collects the proteins and their links whose functional influence scores are greater than a user-specified threshold. This threshold thus determines the size of the generated network. As the threshold decreases, the algorithm returns a larger functional association network.
Results and discussion
Assessment of semantic similarity
Following the underlying idea that functionally related proteins present common physical attributes, semantic similarity measures are usually assessed by evaluating their agreement to other biological features, such as sequence similarity, the presence of common domains, and protein-protein interactions . We employed CESSM to compare the proposed semantic similarity measures, simICND and simICNP, to other state-of-the-art methods with respect to their performance of scoring functional similarity between proteins.
CESSM  is a ready-to-use online tool that evaluates the relationship between semantic measures and other similarities based on sequence, Pfam family , and EC (Enzyme Commission) classification  on a predefined set of 13,430 protein pairs of S. cerevisiae. The selected protein pairs are annotated not only to GO terms but also in Pfam and KEGG databases. For each feature, a quantitative similarity score is calculated for each protein pair. The Pearson correlation is then used to evaluate the agreement between semantic similarity and the other features on the whole data set. The higher the correlation, the better the tested measure. It has been pointed out that, in general, the relationship between sequence and semantic similarity is not linear, and therefore Pearson correlation might not be the best measure to assess their agreement. Indeed, in the comparison with sequence similarity, CESSM also considers resolution as a quality measure . Intuitively, the resolution measures the intensity with which variations in sequence similarity have effects on semantic similarity. A measure with a higher resolution is likely to yield a greater variation, in terms of semantic similarity, between protein pairs with low and high sequence similarity.
To score the semantic similarities of selected protein pairs, we used two ontologies in GO, biological process (BP) and molecular function (MF), separately. GO annotation data have been collected by published results from various high-throughput approaches including both experimental and computational analysis. GO provides evidence codes to indicate the types of methods that create the annotation. All evidence codes have been curated manually with the exception of Inferred from Electronic Annotation (IEA). We thus tested two different sets of semantic similarity scores measured with and without IEA annotations, respectively.
CESSM results of semantic similarities in BP ontology with IEA annotations.
CESSM results of semantic similarities in BP ontology without IEA annotations.
CESSM results of semantic similarities in MF ontology with IEA annotations.
CESSM results of semantic similarities in MF ontology without IEA annotations.
The results of semantic similarities measured in BP ontology with and without IEA annotations are shown in Table 2 and 3. Top two semantic similarity methods for each reference feature (sequence, Pfam, Enzyme Commission classification, or resolution) are shown in bold. When compared to sequence similarity, Resnik's method and simGIC have slightly higher correlations than simICNP and simICND. However, for the other features, the two proposed methods outperform the others. When we compare Table 2 and 3, the similarities measured including IEA annotations have higher correlations than those without IEA annotations over all semantic similarity methods and features.
The results of semantic similarities measured in MF ontology with and without IEA annotations are shown in Table 4 and 5. Among previous methods, simGIC has relatively high correlations. However, when all features are considered, simICNP and simICND clearly show better results than simGIC. Similar to the results in BP ontology, the similarities measured including IEA annotations have higher correlations than those without IEA annotations over almost all semantic similarity methods and features. When we compare the results between BP and MF ontologies, the semantic similarities in BP ontology have higher correlations with sequence similarities, whereas the semantic similarities in MF ontology have higher correlations with Pfam domain and Enzyme Commission class similarities. Overall, the two new semantic similarity measures by merging well-performing previous methods yielded sensible improvements in the CESSM test.
It has been verified that semantic similarity is also a good predictor of PPIs. The rationale behind this is that interacting protein pairs are likely to be involved in similar biological processes or molecular functions, and therefore should present higher values of semantic similarity than non-interacting protein pairs. Thus, given a positive set P of interacting protein pairs and a negative set N of non-interacting protein pairs, semantic similarity measures can be compared in terms of their ability to divide interacting and non-interacting protein pairs.
We validated simICND and simICNP on two PPI data sets of S. cerevisiae with different characteristics. First, a small, high quality positive set P 1 of 11,936 interactions has been extracted from Hint , a database of manually reviewed PPIs. For the second data set, instead, the larger and more complete I2D  network was used as a positive set P 2. I2D is a collection of interactions derived from several databases, and currently counts 147k interactions. The negative sets N 1 and N 2 were built by randomly selecting |P 1 | and |P 2 | protein pairs, respectively, not present in the iRefIndex  data set. iRefIndex is an index of 303k known, experimental or predicted PPIs that appear in a number of primary interaction databases. If a pair of proteins is not listed in iRefIndex, they are unlikely to be interacting with each other. Thus, selecting the negative set as pairs not present in iRefIndex should guarantee a low rate of false negatives. Given a cut-off threshold k, a linear separator predicts as interacting (I) all the protein pairs in P and N with the semantic similarity scores above k, and labels all the other protein pairs as non-interacting (nI).
Two significant indicators to compare the performance of different semantic similarity measures are sensitivity (|P ∩ I|/|P |) and specificity (|N ∩ nI|/|N |). The former, also called a true-positive rate, is the fraction of protein pairs in P whose scores are above the threshold. The latter, also called a true-negative rate, measures the fraction of proteins pairs in N whose scores are below the threshold. A false-positive rate is then calculated by 1-specificity. The true-positive and false-positive rates at different cut-off thresholds are collected and incorporated into a receiver operating characteristics (ROC) curve which is frequently used to evaluate prediction performance on a broad range of cut-off thresholds.
The results of semantic similarities measured in MF ontology with and without IEA annotations are shown in Figure 2 (a) and 2(b). In Figure 2 (a), simICND, simICNP and G-SESAME showed the best performance in predicting PPIs. G-SESAME works the best when the false-positive rate is greater than 0.4, whereas simICND and simICNP have better performance than G-SESAME when the false-positive rate is less than 0.1. Same to the test with BP ontology, the similarities measured including IEA annotations have better performance than those without IEA annotations over all semantic similarity methods. In Figure 2 (b), most methods resulted in similar plots of increasing true-positive rates. However, simICND and simICNP have slightly better performance when the false-positive rate is less than 0.1. Overall, the two new semantic similarity measures perform the best in predicting PPIs with strict threshold values.
Performance evaluation of functional association mining
The performance of our functional association mining approach can be validated by comparing the outcome to functional modules. The genome-wide PPI data of several model organisms are publicly available from many open databases such as BioGRID , IntAct , MINT  and STRING . In this performance test, we used the most recent version of the genome-wide PPI data set of S. cerevisiae from BioGRID, which includes 4,998 distinct proteins and 161,866 interactions. The first step is to weight PPIs. We used simICND since it has the best performance overall from the experiment shown in the previous section. Using a linear function, we transformed all simICND scores into the range between 0 and 1. Next, after selecting 1,000 proteins randomly for a query, we implemented our algorithm with each protein selected. In the information flow simulation, we assigned the initial score 1 to the query protein, and used 0.01 for the threshold to stop the flow on each linked path. As described earlier, we need additional parameter to select proteins and their links for a functional association network. (It will be called the minimum association threshold.) We made this threshold value variable, and examined how accuracy of our approach changes as the threshold changes.
Finally, 1,000 resultant functional association sub-networks were compared to functional modules. We used FunCat data from MIPS  as the functional modules of reference. Since this data set has been manually created, we assumed that it has the highest precision (also known as a positive predictive value). In other words, we assume that this data set rarely contains false positive proteins - the proteins that are included in the same functional module but do not perform the same functions. However, because this data set is not comprehensive and has not been updated recently, it is not guaranteed that it has the highest recall (also known as sensitivity). In other words, this data set might have many false negative proteins - the proteins that perform the same functions but are not included in the same functional module. In this performance test, we therefore measured precision only when the FunCat data set is used as gold-standard.
The functional modules in FunCat are organized in a hierarchical tree structures. The first-level (i.e., top-level) modules indicate the most general descriptions of cellular functions whereas the bottom-level includes the most specific descriptions of cellular functions. We extracted the modules from the top three levels of the tree structure. The average precision of the results comparing to the functional modules on the three different levels is also shown in Figure 3. Although the three plots have a very similar increasing pattern, the highest precision can be achieved when the output subgraphs are compared to the largest modules with the most general functional descriptions because there is a higher chance that the output subgraphs include the proteins in the modules. Overall, the minimum association threshold between 15 and 20 is recommended to have high precision with relatively large sub-networks as output.
Functional association mining software
We introduce the interactive web application software, called M-Finder, to analyze functional associations (functional linkage) from any protein of interest. This web-based tool is designed as an interactive system which enables a user to enter any protein in a query, choose a semantic similarity method, and specify the minimum association threshold as a parameter. Then the information flow algorithm, embedded in this tool, runs with the user inputs on the up-to-date genome-scale PPI network with edge weights, and the generated functional sub-network associated with the query protein is visualized.
The PPI data set is regularly updated with the most recent version from BioGRID . The PPI weights are pre-computed by the semantic similarity methods and stored in our database. Since this large-scale PPI data set is likely to contain a large number of putative false positive interactions, we filter out the PPIs which have the semantic similarity score less than 0.1. This tool currently works for S. cerevisiae only, but it will be extended to C. elegans, D. melanogaster and H. sapiens in near future.
Since the information flow algorithm has a very quick response time (usually less than 10 seconds in our server) even on a large-scale network with complex connectivity, it is suitable for this interactive web-based tool. The threshold to halt the information flow was hard-coded as 0.01 in this tool because we observed this threshold is not sensitive to the result. However, the threshold to select the proteins and their links for a final functional association network should be a parameter that users can enter. The default of this threshold is 20, but it should decrease if a larger functional association network is needed.
M-Finder has further special features. Users can download the output functional association networks to any image files. Users can also search additional information of proteins and interactions on the output functional association network through hyperlinks. For example, ontological information related to the visualized proteins is provided.
This article presented a novel computational method to analyze functional association patterns related to a user-specified protein in a query. This approach adopts the integration of interactome data and GO annotations, and the information flow algorithm reconstructs a functional association network linked to the query protein, which is a small subgraph of the genome-wide PPI network. As discussed in Introduction, a variety of graph clustering algorithms have been applied to detect functional modules from PPI networks. These graph clustering algorithms mostly search densely connected subgraphs assuming that proteins interact to perform a cellular function. However, listing all the clusters, i.e. the sets of proteins, in the genomic scale is not meaningful for genetic studies to characterize a specific gene or a specific function. The proposed approach is thus unique in that it investigates the patterns of potential functional linkage associated with a specific protein of interest. The introduced web application software to analyze and visualize functional association networks would be geared specifically to the needs of systematic and quantitative results in genetic studies.
This study has two significant contributions to current bioinformatics. First, biological data integration is increasingly demanding as an early stage of current data-intensive bioinformatics research. The automated high-throughput technologies have made rapid generation of large-scale data. However, as a downside, they decrease reliability of the data sets. It has been observed that the interactome data currently available in open databases include a large number of false positives, i.e., the spurious interactions which do not occur within a living cell. Although the high-throughput methods have produced interaction data over the entire genome scale, it is expected that there still exist an extremely large amount of false negatives across several model organisms, i.e., the actual interactions that have not been determined yet. The reliability of interaction data can be assessed by inspecting other resources which enable us to judge the feasibility of functional association between genes, such as gene expression profiles. We suggested, in this study, the integration of ontological data for filtering the interactome. We made a complete evaluation of recently proposed two integrative methods of semantic similarity, simICNP and simICND. The CESSM test and PPI test results demonstrated that the proposed approaches outperform the previous methods in terms of measuring functional closeness of two proteins. Our ontological data integration model would provide an effective framework for curation of genome, transcriptome, proteome and interactome data.
Second, efficiency and scalability are key issues on the large-scale, complex interactome data mining. A single protein influences multiple phenotypes in different environmental conditions, known as the pleiotropic effect. When separating the conditions is disregarded, interaction networks are typically structured by complex connectivity. Moreover, the scale of the interactome data increases remarkably for higher-level organisms in evolution. The proposed approach is formulated based off a data-mining technique which is implemented efficiently on large-scale networks with complex connectivity. Our heuristic model enables us to search functional associations very efficiently by simulating random walks. This efficient and scalable approach would be generalized to any integrative analysis of complex systems. This would also be the best fit to be embedded into a web application tool introduced in this article. For further improvement of the tool MFinder in terms of efficiency and accuracy, we can explore the inherent topological properties of the genome-wide PPI networks. It might be feasible to predict and suggest the best parameter value for a specific species by analyzing the properties.
This work was supported in part by funds from the Baylor University Research Committee and the Vice Provost for Research.
The publication costs for this article were funded by the corresponding author.
This article has been published as part of Proteome Science Volume 11 Supplement 1, 2013: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2012: Proteome Science. The full contents of the supplement are available online at http://www.proteomesci.com/supplements/11/S1.
- Shoemaker BA, Panchenko AR: Deciphering protein-protein interactions. Part 1. Experimental techniques and databases. PLoS Computational Biology 2007,3(3):e42. 10.1371/journal.pcbi.0030042PubMed CentralPubMedView ArticleGoogle Scholar
- Bonetta L: Interactome under construction. Nature 2010, 468: 851–854. 10.1038/468851aPubMedView ArticleGoogle Scholar
- Cannataro M, Guzzi PH, Veltri P: Protein-protein interactions: Technology, databases, and algorithms. ACM Computing Surveys 2010,43(1):1–36.View ArticleGoogle Scholar
- Yu H, et al.: High-quality binary protein interaction map of the yeast interactome network. Science 2008, 322: 104–110. 10.1126/science.1158684PubMed CentralPubMedView ArticleGoogle Scholar
- Venkatesan K, et al.: An empirical framework for binary interactome mapping. Nature Method 2009,6(1):83–90. 10.1038/nmeth.1280View ArticleGoogle Scholar
- Braun P, et al.: An experimentally derived confidence score for binary protein-protein interactions. Nature Method 2009,6(1):91–97. 10.1038/nmeth.1281View ArticleGoogle Scholar
- Schwartz AS, Yu J, Gardenour KR, Finley RL Jr, Ideker T: Cost-effective strategies for completing the interactome. Nature Method 2009,6(1):55–61. 10.1038/nmeth.1283View ArticleGoogle Scholar
- Brohee S, van Helden J: Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics 2006, 7: 488. 10.1186/1471-2105-7-488PubMed CentralPubMedView ArticleGoogle Scholar
- Vlasblom J, Wodak SJ: Markov clustering versus affinity propagation for the partitioning of protein interaction graphs. BMC Bioinformatics 2009, 10: 99. 10.1186/1471-2105-10-99PubMed CentralPubMedView ArticleGoogle Scholar
- Song J, Singh M: How and when should interactome-derived clusters be used to predict functional modules and protein function? Bioinformatics 2009,25(23):3143–3150. 10.1093/bioinformatics/btp551PubMed CentralPubMedView ArticleGoogle Scholar
- Cannataro M, Guzzi PH, Veltri P: IMPRECO: Distributed prediction of protein complexes. Future Generation Computer Systems 2010, 26: 434–440. 10.1016/j.future.2009.08.001View ArticleGoogle Scholar
- Kenley EC, Cho Y-R: Detecting protein complexes and functional modules from protein interaction networks: A graph entropy approach. Proteomics 2011,11(19):3835–3844. 10.1002/pmic.201100193View ArticleGoogle Scholar
- Li X, Wu M, Kwoh C-K, Ng S-K: Computational approaches for detecting protein complexes from protein interaction networks: a survey. BMC Genomics 2010,11(Suppl 1):S3. 10.1186/1471-2164-11-S1-S3PubMed CentralPubMedView ArticleGoogle Scholar
- Lord PW, Stevens RD, Brass A, Goble CA: Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 2003,19(10):1275–1283. 10.1093/bioinformatics/btg153PubMedView ArticleGoogle Scholar
- The Gene Ontology Consortium: The Gene Ontology: enhancements for 2011. Nucleic Acids Research 2012, 40: D559-D564.PubMed CentralView ArticleGoogle Scholar
- Cho Y-R, Chiam TC, Lu Y: M-Finder: Functional association mining from protein interaction networks weighted by semantic similarity. Proceedings of IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 2012, 95–100.Google Scholar
- Stark C, et al.: The BioGRID interaction database: 2011 update. Nucleic Acids Research 2011, 39: D698-D704. 10.1093/nar/gkq1116PubMed CentralPubMedView ArticleGoogle Scholar
- Bard JBL, Rhee SY: Ontologies in biology: design, applications and future challenges. Nature Reviews: Genetics 2004, 5: 213–222. 10.1038/nrg1295PubMedView ArticleGoogle Scholar
- The Gene Ontology Consortium: Gene Ontology: tool for the unification of biology. Nature Genetics 2000, 25: 25–29. 10.1038/75556PubMed CentralView ArticleGoogle Scholar
- Pesquita C, Faria D, Falcao AO, Lord P, Couto FM: Semantic similarity in biomedical ontologies. PLoS Computational Biology 2009,5(7):e1000443. 10.1371/journal.pcbi.1000443PubMed CentralPubMedView ArticleGoogle Scholar
- Wang J, Zhou X, Zhu J, Zhou C, Guo Z: Revealing and avoiding bias in semantic similarity scores for protein pairs. BMC Bioinformatics 2010, 11: 290. 10.1186/1471-2105-11-290PubMed CentralPubMedView ArticleGoogle Scholar
- Guzzi PH, Mina M, Guerra C, Cannataro M: Semantic similarity analysis of protein data: assessment with biological features and issues. Briefings in Bioinformatics 2012.,13(5):Google Scholar
- Mistry M, Pavlidis P: Gene Ontology term overlap as a measure of gene functional similarity. BMC Bioinformatics 2008, 9: 327. 10.1186/1471-2105-9-327PubMed CentralPubMedView ArticleGoogle Scholar
- Guo X, Liu R, Shriver CD, Hu H, Liebman MN: Assessing semantic similarity measures for the characterization of human regulatory pathways. Bioinformatics 2006,22(8):967–973. 10.1093/bioinformatics/btl042PubMedView ArticleGoogle Scholar
- Resnik P: Using information content to evaluate semantic similarity in a taxonomy. Proceedings of 14th International Joint Conference on Artificial Intelligence 1995, 448–453.Google Scholar
- Lin D: An information-theoretic definition of similarity. Proceedings of 15th International Conference on Machine Learning (ICML) 1998, 296–304.Google Scholar
- Jiang JJ, Conrath DW: Semantic similarity based on corpus statistics and lexical taxonomy. Proceedings of 10th International Conference on Research in Computational Linguistics 1997.Google Scholar
- Schlicker A, Domingues FS, Rahnenfuhrer J, Lengauer T: A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics 2006, 7: 302. 10.1186/1471-2105-7-302PubMed CentralPubMedView ArticleGoogle Scholar
- Couto FM, Silva MJ, Coutinho PM: Measuring semantic similarity between Gene Ontology terms. Data and Knowledge Engineering 2007, 61: 137–152. 10.1016/j.datak.2006.05.003View ArticleGoogle Scholar
- Couto FM, Silva MJ: Disjunctive shared information between ontology concepts: application to Gene Ontology. Journal of Biomedical Semantics 2011, 2: 5. 10.1186/2041-1480-2-5PubMed CentralPubMedView ArticleGoogle Scholar
- Wang JZ, Du Z, Payattakool R, Yu PS, Chen C-F: A new method to measure the semantic similarity of GO terms. Bioinformatics 2007.,23(10):Google Scholar
- Pesquita C, Faria D, Bastos H, Ferreira AEN, Falcao AO, Couto FM: Metrics for GO based protein semantic similarity: a systematic evaluation. BMC Bioinformatics 2008,9(Suppl 5):S4. 10.1186/1471-2105-9-S5-S4PubMed CentralPubMedView ArticleGoogle Scholar
- Benabderrahmane S, Smail-Tabbone M, Poch O, Napoli A, Devignes M-D: IntelliGO: a new vector-based semantic similarity measure including annotation origin. BMC Bioinformatics 2010, 11: 588. 10.1186/1471-2105-11-588PubMed CentralPubMedView ArticleGoogle Scholar
- Jain S, Bader GD: An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology. BMC Bioinformatics 2010, 11: 562. 10.1186/1471-2105-11-562PubMed CentralPubMedView ArticleGoogle Scholar
- Tao Y, Sam L, Li J, Friedman C, Lussier YA: Information theory applied to the sparse gene ontology annotation network to predict novel gene function. Bioinformatics 2007, 23: i529-i538. 10.1093/bioinformatics/btm195PubMed CentralPubMedView ArticleGoogle Scholar
- Cho Y-R, Hwang W, Ramanathan M, Zhang A: Semantic integration to identify overlapping functional modules in protein interaction networks. BMC Bioinformatics 2007, 8: 265. 10.1186/1471-2105-8-265PubMed CentralPubMedView ArticleGoogle Scholar
- Pesquita C, Pessoa D, Faria D, Couto FM: CESSM: Collaborative evaluation of semantic similarity measures. Proceedings of JB2009 - Challenges in Bioinformatics 2009, 128–132.Google Scholar
- Punta M, et al.: The Pfam protein families database. Nucleic Acids Research 2012,40(D1):D290-D301. 10.1093/nar/gkr1065PubMed CentralPubMedView ArticleGoogle Scholar
- Bairoch A: The ENZYME database in 2000. Nucleic Acids Research 2000,28(1):304–305. 10.1093/nar/28.1.304PubMed CentralPubMedView ArticleGoogle Scholar
- Patil A, Nakamura H: Filtering high-throughput protein-protein interaction data using a combination of genomic features. BMC Bioinformatics 2005, 6: 100. 10.1186/1471-2105-6-100PubMed CentralPubMedView ArticleGoogle Scholar
- Brown KR, Jurisica I: Online predicted human interaction database. Bioinformatics 2005,21(9):2076–2082. 10.1093/bioinformatics/bti273PubMedView ArticleGoogle Scholar
- Razick S, Magklaras G, Donaldson IM: iRefIndex: A consolidated protein interaction database with provenance. BMC Bioinformatics 2008, 9: 405. 10.1186/1471-2105-9-405PubMed CentralPubMedView ArticleGoogle Scholar
- Aranda B, et al.: The IntAct molecular interaction database in 2010. Nucleic Acids Research 2010, 38: D525-D531. 10.1093/nar/gkp878PubMed CentralPubMedView ArticleGoogle Scholar
- Ceol A, Chatr-aryamontri A, Licata L, Peluso D, Briganti L, Perfetto L, Castagnoli L, Cesareni G: MINT: the molecular interaction database: 2009 update. Nucleic Acids Research 2010, 38: D532-D539. 10.1093/nar/gkp983PubMed CentralPubMedView ArticleGoogle Scholar
- Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, Doerks T, Stark M, Muller J, Bork P, Jensen LJ, von Mering C: The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Research 2011, 39: D561-D568. 10.1093/nar/gkq973PubMed CentralPubMedView ArticleGoogle Scholar
- Ruepp A, Zollner A, Maier D, Albermann K, Hani J, Mokrejs M, Tetko I, Guldener U, Mannhaupt G, Munsterkotter M, Mewes HW: The FunCat: a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Research 2004,32(18):5539–5545. 10.1093/nar/gkh894PubMed CentralPubMedView ArticleGoogle Scholar
- Lopes CT, Franz M, Kazi F, Donaldson SL, Morris Q, Bader GD: Cytoscape Web: an interactive web-based network browser. Bioinformatics 2010,26(18):2347–2348. 10.1093/bioinformatics/btq430PubMed CentralPubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.