Fully automated protein complex prediction based on topological similarity and community structure

To understand the function of protein complexes and their association with biological processes, a lot of studies have been done towards analyzing the protein-protein interaction (PPI) networks. However, the advancement in high-throughput technology has resulted in a humongous amount of data for analysis. Moreover, high level of noise, sparseness, and skewness in degree distribution of PPI networks limits the performance of many clustering algorithms and further analysis of their interactions. In addressing and solving these problems we present a novel random walk based algorithm that converts the incomplete and binary PPI network into a protein-protein topological similarity matrix (PP-TS matrix). We believe that if two proteins share some high-order topological similarities they are likely to be interacting with each other. Using the obtained PP-TS matrix, we constructed and used weighted networks to further study and analyze the interaction among proteins. Specifically, we applied a fully automated community structure finding algorithm (Auto-HQcut) on the obtained weighted network to cluster protein complexes. We then analyzed the protein complexes for significance in biological processes. To help visualize and analyze these protein complexes we also developed an interface that displays the resulting complexes as well as the characteristics associated with each complex. Applying our approach to a yeast protein-protein interaction network, we found that the predicted protein-protein interaction pairs with high topological similarities have more significant biological relevance than the original protein-protein interactions pairs. When we compared our PPI network reconstruction algorithm with other existing algorithms using gene ontology and gene co-expression, our algorithm produced the highest similarity scores. Also, our predicted protein complexes showed higher accuracy measure compared to the other protein complex predictions.


Introduction
Protein-protein interaction (PPI) is the core to many fundamental biological processes. New high-throughput techniques, such as yeast two-hybrid and tandem affinity purification [1], have vastly increased the size of the protein-protein interaction data. With this large amount of protein-protein interaction (PPI) data which is usually modeled by PPI networks, the cell mechanistic can be understood at system level.
The growing PPI database helps both biological and computational scientists to predict gene functions, functional pathways, protein complexes and improve the diagnosis and treatment of diseases [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16]. However, the growing of PPI network poses multiple challenges at the same time, such as PPI networks often have a high false positive rate and an even higher false negative rate [17]. PPI networks are also known to have skewed degree distribution, meaning that they have more than expected quantity of hub genes [18]. Additionally, the PPI networks are typically binary (sometimes with limited discrete value) and sparse, partially due to the high false negative rate, which places a hurdle for protein complex prediction.
In biological networks, community structures (protein complexes) normally have key biological significance. Identifying and analyzing such communities leads to the understanding of the nature of the system as well as potential novel relationships between communities [19]. One approach to identifying communities is using modularity-based method which aims at optimizing modularity (Q) and shown to be effective in measuring the overall quality of community structures [20]. Qcut [19] is an efficient heuristic algorithm that combines spectral graph partitioning and local search to optimize Q. It is able to find stronger community structures in large and relatively dense networks. For larger networks, the modularity function may face resolution limit problem since communities with relatively high intercommunity connectivity may be merged into a single community [21]. HQcut [19] solves this resolution limit problem by applying Qcut recursively to the communities that have already been identified.
Recently, we proposed a novel idea to predict proteinprotein interactions based on topological similarity between nodes in a given PPI network [18,22]. Basically, we consider two nodes to be similar if they have similar distances to all other nodes in the network (instead of only their direct neighbors), measured by a novel random walk procedure. While the results suggest that the predicted edges likely represent true physical protein-protein interactions, the algorithm relies on an arbitrary similarity cutoff to distinguish between interacting vs non-interacting protein pairs. In particularly, given the sparsity of the known PPI networks, we believe the coverage of PPIs can be significantly improved if an optimal cutoff can be selected.
In this work, we extend our work to introduce a novel approach for determining an optimal cutoff to predict protein-protein interactions from the weighted topological similarity matrix. Our idea is that a good cutoff should result in a network that is highly modular. Based on this idea, we also present a parameter-free, recursive algorithm, Auto-HQcut, to identify protein complexes from the PPI network. We also developed a graphical user interface to enable visualization of the weighted network and the predicted protein complexes.
To evaluate our approach, we applied the network reconstruction algorithm to a yeast PPI network and examine the biological relevance of the resulting PPI network. The resulting PPI network showed interactions with more functional relevance in comparison to the original PPI network. Comparison with existing methods showed that the network reconstructed by our method has the highest overall quality. Furthermore, applying Auto-HQcut clustering algorithm, we found that the reconstructed network had significantly improved prediction accuracy of protein complexes.

Results
For evaluation, we applied our algorithm to a yeast core PPI network obtained from [23], which covers 2708 genes with 7123 edges. By performing a modified random walk on this network and calculating similarities between every pair of nodes based on their topology equivalence, we derived a modified Protein-Protein Topological Similarity matrix (called PP-TS matrix), which covers all the topological similarities between any pair of genes in the network. To evaluate the functional relevance of the newly predicted pairwise similarities, we resort to two types of sources, gene ontology and gene expression.
For clustering, we used the methods based on the optimization of a modularity function (Q). We developed an approach, Auto-HQcut, to automatically determine the best parameter for the algorithm HQcut. We compared the clustering accuracy of each resulting protein complex to the known complexes. The results showed that our approach could be used to deduce the significant biological processes from any PPI network.

Biological functional relevance
The performance of the algorithm can be tested by evaluating the pairwise genes with top similarity scores in the PP-TS matrix for functional relevance. It is expected for the gene pairs with high PP-TS scores to be functionally more relevant. To determine the functional relevance between the gene pairs, we used the functional annotations in gene ontology (GO) and gene expression (GE) patterns across diverse conditions.
To represent the functional relevance of the top gene pairs in the PP-TS matrix, we calculated the average gene ontology similarity (called Average GO score) and gene co-expression (called Average GE score) of the gene pairs with the top PP-TS scores, as shown in Figure 1.
For GO, we used the gene pairs with top PP-TS score to obtain the same average GO score as the original PPI network to identify more associated gene pairs (26170 compared to 7123) which were not available in the original network. On the other hand, if we keep the same number of edges ("Original PPI edges" in Figure 1) as in original PPI network, the average GO similarity score is improved from 0.5512 to 0.6956. Similar to the GO, for GE, keeping the same average GE score, we get more associated gene pairs (15480 compared to 7123). Keeping the same number of edges as in original PPI network, the average GE similarity score is improved from 0.1865 to 0.2295.
With these two independent sources of evidence, the results show that the gene pairs with higher topological similarity scores have higher biological functional relevance.
The results also show a consistent pattern between gene ontology similarity and gene co-expression.
We compared our algorithm with four existing methods, namely Euclidean commute time (ECT) [24], random walk with restart (RWR) [25], multiple dimensional scaling (MDS) [26], and global geometric affinity (GGA) [27]. The ECT and RWR methods are well-known in data mining and network analysis communities, while the MDS and GGA methods were recently proposed to improve the quality of PPI networks. All four algorithms calculate some topology-based similarity scores for pairs of nodes. Figure 2 shows the average GO similarity score of the top gene pairs in the PP-TS matrix. As shown in the figure, the average score generated by our method has the highest GO similarity.
The average GE similarity scores are shown in Figure 3. Among all the 5 algorithms, both RWS and ECT have the best performance in gene co-expression score, however ECT has much lower GO similarity score compared to RWS as shown in Figure 2.

Protein complex accuracy
As the PP-TS matrix is a weighted and completely connected network, thresholds values are used to convert it to a sparse network. To this end, we chose the cutoff value such that the average GO value of the interacting pairs in the modified network is the same as that of the original PPI network. This resulted in 26170 edges, as mentioned above. We then applied two network clustering algorithms, Qcut and HQcut, to the original and RWS modified networks, and compared the predicted complexes with the MIPS known protein complexes (see Methods). As shown in Figure 4, the prediction accuracy is significantly improved for both Qcut and HQcut on the RWS modified network compared to the original network, demonstrating that the network quality improvement by RWS is general. On the other hand, HQcut always achieves better accuracy than Qcut, due to HQcut's strategy in addressing the resolution limit problem.
Again, we compared our algorithm with the four existing methods, ECT, RWR, MDS and GGA. For a fair comparison, we choose a different cutoff for each algorithm so that the predicted PPI networks from different algorithms all have the same number of edges as the RWS modified   PPI network. These networks are then subject to protein complex prediction using the HQcut algorithm. As shown in Figure 5, only RWS and RWR increased the complex prediction accuracy, and RWS is slightly better than RWR.
In order to further improve the protein complex prediction accuracy of HQcut, we reconstructed a series of weighted PPI networks with different cutoffs, and run HQcut on each of the network to identify potential protein complexes. We also measured the modularity differences between the selected networks and random generated networks. As shown in Figure 6, the modularity difference is well correlated with the protein complex prediction accuracy. Using the automatically determined PP-TS similarity cutoff (0.65) at the largest modularity difference value (0.49), the corresponding complex prediction accuracy is close to the optimal prediction accuracy. As shown in Figure 6, RWS modified PP-TS matrix with the fully automated parameter determination HQcut (called Auto-HQcut) has significantly improved the prediction accuracy compared to other algorithms, demonstrating that both the RWS and Auto-HQcut can help improve complex prediction accuracy. Figure 7(a) and 7(b) show the accuracy distribution for the original PPI network and the RWS modified network with automatic parameter determination HQcut. Among the 170 known complexes, 77 clusters had increased accuracy, while 65 of them kept the same accuracy, and 28 clusters had decreased accuracy. As shown in Figure 7, the number of protein complexes with near perfect prediction is much improved from 22 complexes to 39 complexes. Also, there is a dramatic drop in the number of complexes at the accuracy level of 0.2 which is attributed to the more accurate prediction of the small size clusters. We also observed that most of the improved complexes are small in size.
To display the result of protein complexes obtained, we developed an interface to show the complexes and their associated properties. The result on Figure 8 shows some of the protein complexes obtained from using our approach. It can be seen the proteins belonging to the same complex are colored the same and are strongly connected. Protein belonging to different complexes are colored differently and the edges between complexes (outer edges) are fewer compared to edges within the complex (inner edges). More statistics related to the network and

Protein-protein topological similarity matrix
Since the original PPI network is binary, incomplete in nature, and is not good for the clustering algorithms, we used the Random Walk with Resistance(RWS) [22] to reconstruct the network for better clustering performance.
Let G(V, E) be an undirected graph, V the set of vertices and E the set of edges. With the basic random walk idea, the probability to move from an initial node v to its neighbor node u in the next step is is the degree of node v. In Figure 7 Complex prediction accuracy distribution for the predicted complexes. RWS algorithm, we introduced two types of resistance ε and b and the probability for a random walker initiated from node v to take the edge (i, j) at time point k + 1 can be calculated by Equation (1), where (1) and the probability for the random walker to reach node j at time point k + 1 can be calculated as (2) The RWS algorithm is a modified Random Walk algorithm which uses two types of resistances. The first type of resistance(ε) makes the final convergence status different for each node based on its topology within the network. The second type of resistance(b) is used to control the depth of a random walk and helps to avoid the hub node effect. When applying the RWS algorithm to the original PPI network, we obtain all the topological profiles for each node. These topological profiles are used to calculate the correlation between all pairs of nodes, where correlation value represents the topological similarity between two nodes. The combined topological similarity for all pairs of nodes builds the PP-TS matrix, which can be viewed as a network with weighted edges.

Clustering procedure
In clustering the network, we first apply Qcut algorithm [19] which optimizes the modularity function (Q) of a given network [20]. However, because Qcut faces a resolution limit problem where communities smaller than a certain scale will never be identified, we further applied HQcut algorithm to the network to solve this problem. Even though HQcut works well, unlike the initial Qcut algorithm, a user has to input parameters to obtain the best Q value. Using HQcut, we devised Auto-HQcut which automatically picks the cutoff that gives the maximum Q value. A brief description of each algorithm is provided below.

Qcut algorithm
Qcut algorithm involves two steps: partitioning and refining. In partitioning, the network is recursively divided each time calculating the Q value until no further improvement is found in the Q value. In refining stage, the communities obtained in the partitioning stage are further refined to try to improve the Q value. In order to achieve this, the communities obtained are changed by: either moving the vertex from one community to another, combining two communities to form a new bigger community, or splitting a community to form two small communities. If the change in the original community results into a better Q value, a change is considered, otherwise no change is made to the original community.

HQcut algorithm
HQcut algorithm [19] improves Qcut algorithm by solving the resolution limit problem. It uses the Qcut algorithm to obtain the community structure which results in the highest Q value. To determine if the individual communities should further be partitioned, Qcut is applied again to the individual communities using threshold value that measures the strength of the community structure and the statistical significance of the original communities obtained. HQcut uses Q ≤ 0.3 threshold and p-value of 0.05 as a measure of community structure strength and its statistical significance respectively to decide whether the individual communities should further be partitioned and find whether partitioning would result in a better Q value.

Auto-HQcut
As seen from the algorithm description above, HQcut can further optimize the Q value hence increase prediction accuracy of protein complexes. However, it is hard to choose the best cutoff to apply on the PP-TS matrix that would produce the network with the best prediction accuracy of the protein complexes. As a result, we run HQcut to the networks generated from PP-TS matrix with different cutoffs, and measure the statistical significance of the resulting Q value by comparing the selected networks with the random generated networks. By using the modularity difference between these two values, we can tell if the selected network can predict the protein complexes from the random one. At the end, we pick the cutoff which gives the largest modularity difference value to be used to generate the protein complexes.

Biological functional relevance
To evaluate the biological relevance of the newly predicted edges, we resort to gene ontology and gene expression, which are widely used in the functional evaluation. To measure the biological functional relevance between any pair of genes, we used the semantic similarity between the GO terms annotated with the proteins, with a popular method [28,29], which incorporates both a global metric and a local metric for balance and consistency. Results shown in this paper are based on the "Molecular Function" branch of Gene Ontology. Using "Biological Process" yielded very similar values, and "Cellular Localization" resulted in slightly lower but consistent values. We also measured the Pearson correlation coefficient between the gene expression profiles of every pair of genes using the yeast stress response microarray data [30]. With the results from above two methods, we can use the GO/GE similarity between a pair of nodes to represent the biological relevance of those two proteins.

Complex prediction accuracy
To investigate the protein complexes prediction accuracy improvement, we compared the predicted complexes with the MIPS known protein complexes [31], which include 767 proteins in 170 known complexes after intersecting with the PPI network. To measure the accuracy of the prediction, we used the Fowlkes-Mallows index for comparing clustering [32,33]. Formally, let A be the list of gene pairs that fall into the same complex in the set of predicted complexes and B be the list of gene pairs in the set of known complexes, the prediction accuracy is measured by |A ∩ B|/ √ |A| × |B|, where |A| denotes the cardinality of the set A.

Visualization
To help visualize protein complexes obtained from our approach, we developed an interface also used as a plugin in Cytoscape [34]. The interface was developed in MATLAB and compiled into java classes using MATLAB java compiler. The interface displays the interaction of the obtained protein complexes and the properties of the protein complexes such as average connectivity, ratio between intra-cluster edges and intercluster edges, and the number of proteins within the complex. One unique feature of the interface is its ability to uniquely color the protein complexes so that different complexes within the network can easily be differentiated.

Conclusion
In this paper, we developed a random walk based algorithm that converts the PPI network into a protein-protein topological similarity matrix which is then used to construct weighted networks. The key idea is that two proteins sharing some high-order topological similarities are likely to be interacting with each other and be involved in the same biological processes. Using the reconstructed weighted network, we can measure the interactions of all the protein-protein pairs using a real value as opposed to the "connected/non-connected" measure in the original PPI network, which also helps to reduce noise and uncover significant biological interactions that would otherwise be overlooked when analyzing the original PPI network. We then used a parameter free modularity based community finding algorithm (Auto-HQcut) to identify protein complexes from PPI network by optimizing the modularity function. Finally, we used the interface we developed to visualize and analyze the characteristics of the protein complexes obtained. The results showed that the algorithm can find higher modularity protein complexes with better prediction accuracy.
In summary, the weighted network reconstructed from PP-TS matrix has much higher biological relevance than the original network and Auto-HQcut significantly improved protein complex prediction accuracy. Since our method improved the protein-protein similarity quality without any additional biological information involved, the algorithm can be easily combined with other approaches to improve the analysis of PPI networks and protein complex prediction.