### Weighted network

Weights quantify the likelihood of the interaction between every pair of proteins, and they can be estimated by encoding the proteins using gene ontology (GO) consortium. “Ontology” is a specification of a conceptualization that refers to the subject of existence. GO is established by the following three criteria: (I) biological process referring to a biological objective to which the gene or gene product contributes; (II) molecular function defined as the biochemical activity of a gene product; (III) cellular component referring to the place in the cell where a gene product is active. It is very common for the same protein or proteins in the same subfamily to form protein complexes, for example, protein Ste2p and Ste3p from a complex that is among activated G protein-coupled receptors in yeast cellular mating.[15] It is also common for proteins in heterofamilies to form protein complexes if they share a conservative motif, for example, protein Ctf19, Mcm21, and Okp1 from a heterocomplex in the budding yeast kinetochore.[16] Complicated protein complexes may be formed by multiple proteins, some of which share same biological processes and some are from the same subfamily, for example, Dsl1p complex, involved in Golgi-ER retrograde transport, includes Dsl1p, Dsl3p, Q/t-SNARE proteins, and so forth.[17] Thus GO consortium is considered to be a very helpful vehicle for investigating protein-protein interactions,[18] because these three criteria reflect the attribute of gene, gene product, gene-product groups and the subcellular localization[19–21].

Semantic similarity has been used in Information Science to evaluate the similarity between two concepts in a taxonomy[22], and we applied it to protein-protein interactions to estimate the similarity between two proteins. Based on the previous method [23], we proposed our semantic similarity method. We define an annotation size of a GO term as the number of annotated proteins on the GO term. The semantic similarity between two proteins is then calculated based on the annotation size of the GO term, on which both proteins are annotated. According to the transitivity property of GO annotation, if a protein x is annotated on a GO term g_{i}, it is also annotated on the GO terms on the path from g_{i} to the root GO term in the GO structure. Thus, the proportion of the annotation size of a GO term to the total number of annotated proteins can quantify the specificity of the GO term. If two proteins are annotated on a more specific GO term and have more common GO terms, then they are functionally more similar.

Suppose a protein x is annotated on m different GO terms. S_{i}(x) denotes a set of annotated proteins on the GO term g_{i}, whose annotation includes x, where 1≤i≤m. In the same way, suppose both x and y are annotated on n different GO terms, where n≤m. S_{j}(x, y) denotes a set of annotated proteins on the GO term g_{j}, whose annotation includes x and y, where 1≤j≤n. Then, the minimum size of S_{i}(x), min_{i}|S_{i}(x)|, is less than or equal to min_{j}|Sj(x, y)|.C(x,y) denotes the sets of GO terms, whose annotation includes x and y. |C(x,y)| is the number of common GO terms which x and y both have.

Suppose the size of annotation represents the number of annotated proteins on a GO term. Using the annotation size of the most specific GO term, on which two proteins x and y are annotated, we define semantic similarity S

_{sem}(x, y) between x and y as follows:

S_{max} is the maximum size of annotation among all GO terms in a DAG structure. If two proteins x and y are annotated on a more specific GO term and more common GO terms than x and z, then x is semantically more similar to y than z.

Considering the graph topology, we also involve the topology weight. For an input graph G = (V, E), we assign the topology weight of an edge [u, v] to be the number of neighbors shared by the vertices u and v. Then we assigned the sum of S_{sem}(u, v) and topology weight to the edge between u and v as a weight.

### Extending cluster

We introduce a new concept to measure how strongly a vertex v is connected to a subgraph K: the interaction probability E

_{vk} of a vertex v to a subgraph K, where v∉K, is defined by

Where e_{vk} is the sum of the weights of edges between the vertex v and K, and w_{k} is the sum of weights of edges in K. We discuss the relationship between the parameter E_{vk} and IN_{vK} introduced in the algorithm IPCA[14]. According to [14], IN_{vK} is defined as
, where m_{vK} is the number of edges between the vertex v and K, and n_{K} is the number of vertices in K. By the expressions, our parameter E_{vk} is similar to the parameter IN_{vK}. While our parameter considers with the biological weights, it have more biological meaning.

A cluster K is extended by adding vertices recursively from its neighbors according to the priority. The priority of a neighbor v of K is determined by the value E_{vk}. This procedure is similar to the one proposed in IPCA [14], except that we do not use IN_{vk} to judge the extending. So whether a high priority vertex v is added to the cluster is determined by the Extend-judgment test below.

Let T

_{in} be a threshold ranging between 0 and 1, let d be a positive integer, and let K be a subgraph. SP is the shortest path. A vertex v∉K is added to the cluster if the following two conditions are satisfied (where K + v denotes the subgraph induced by K and v):

- 1.

- 2.

Only when the candidate vertex v is satisfied the conditions, can it be added to the cluster. Once the new vertex v is added to the cluster, the cluster is updated.