Global network alignment
A PPI network can be represented by an undirected simple network G = (V
G
, E
G
), where V
G
= (v
1 , ..., v
N
) is a finite set of N vertices representing the N proteins, and E
G
is the set of edges representing the pairs of interacting proteins. Given two PPI networks G and H for various species (without loss of generality, we assume N <M, where |V
G
| = N, |V
H
| = M). The GNA problem is to find a total injective function f : V
G
→ V
H
which matches similar proteins and enforces as much as possible the conservation of interactions between matched pairs in the two networks. Also, no two nodes from the smaller network G can be aligned to the same node in the larger network H. To quantify how topologically similar two networks are, we can use the edge correctness (EC) measure [15]:
(1)
EC is the percentage of edges from a smaller network that are correctly aligned to edges in a bigger network. Naturally, when aligning two networks, we want to achieve as high EC as possible. GNA is NP-complete meaning as the underlying subnetwork isomorphism problem and heuristic approaches must be devised to get thus approximate solutions, especially for the large-sized PPI networks.
On the other hand maximizing the EC should not be the unique goal for biological networks alignment and we must strike a balance between topological and biological significances of our result. So our MHA improve the EC while maintaining the structure of functional modules. Although being the same as a seed-and-extend approach, MHA can solve the significant seeds chosen problem of such heuristics to a certain extent.
The seed selection strategy and our algorithm
Figure 2 shows an example of applying our method to the GNA problem. Firstly, we determine the centralities of nodes in each network and show the two networks in Figure 2(a) with the sizes of nodes proportional to their centrality values. Then the multiple hub seed set {(1, a), (6, k)} can be constructed for their local maximum similarities. It can be seen obviously that nodes {1, 6, a, k} are local hubs of networks respectively. Secondly, in Figure 2(b), beginning with the seed (1, a), the fractional alignment result {(1, a), (2, b), (3, c), (4, e), (5, d), (6, g)} is obtained as shown in Figure 2(b); Simultaneity, beginning with the seed (6, k), another fractional alignment result {(6, k), (7, l), (8, m), (9, n) } is obtained as shown in Figure 2(c). During this alignment process the dynamic membership similarity is changeable with the current aligned seed. MHA enables us to get a more overall consistent alignment shown as Alignment2 of Figure 1(c). MHA aligned two dense regions in network1 (marked with colours red and purple) according to distinct dense regions in network2 (marked with colours green and blue).
What follows is the detailed description of the major steps of the MHA method. In order to construct hub seeds for the seed-and-extend process, we should compute centralities and membership values of nodes in each network. An integrative network module identification and key nodes determination method family, called ModuLand [19], can be used to gain the values of the steps 1-3 straightway as follows:
Step 1: Determination of centralities of nodes in networks. NodeLand algorithm [19] iteratively determines the centrality value of node i ∈ V
G
in respective networks G = (V
G
, E
G
). We compute the centrality similarity (SC) measures between all pairs of nodes (i, j) ∈ V
G
× V
H
from networks G and H. Then a N × M - sized matrix SC is defined as:
(2)
(3)
Respectively, , and θ is the average centrality value of all nodes in networks G and H. There are always plentiful pairs of nodes and the differences between their centrality values are not an order of magnitude. The logarithmic value and should be employed in practice. Note that SC(i, j) will be large if node i and j are both significant nodes ( and are all higher) in their own network and has little gap of their centrality values. θ 's presence is to address the issue that a pair of similar nodes (i, j) with little gap should have a certain magnitude SC even if they are of very small roles in topology.
Step 2: Determination of multiple hub seeds. Here we present one concise approach suitable for the determination of hub seeds of networks. Local maxima based hub seeds is defined as: A hub seed is the pair of nodes having the locally maximal SC value, while all of their neighbouring nodes have lower SC values. The result of this step is the set SEEDS containing all hub seeds.
Functional modules in PPI are associated with a single central hub protein. Beginning with a hub (h
g
, h
h
) ∈ SEEDS to extend and align the neighbourhood of the seed, this seed-and-extend process is to construct functional modules having similar biological function and topology around the hubs respectively. For every hub seed does the same process, MHA has avoided the phenomenon that multiple functional modules in G are aligned to a few dense regions in H and can align the networks by modules.
Step 3: Determination of membership values. The membership value calculated by ProportionalHill method [19] quantifies to what extent a node can participate in a functional module associated with the current seed topologically. has been used to indicate the membership value of node i relative to seed node h
g
, and a seed node gets the maximum membership value, , relative to itself. We construct membership similarity (SM) between all pairs of the neighbouring nodes (i, j) for a hub seed (h
g
, h
h
).
(4)
SM (i, j) is the harmonic mean of and . It is not necessary to calculate SM (i, j) between all pairs of nodes (i, j) ∈ V
G
× V
H
relative to all hub seeds in H
G
and H
H
. For some current seed (h
g
, h
h
), MHA only needs to get the SM values between neighbouring nodes around h
g
and h
h
during alignment process (in step 5) dynamically.
Step 4: Construction of the "similarity scores" matrix S
N×M
. We implement MHA by using the topological similarity (SC) between nodes in two networks, along with the sequence similarity (SE) given by the BLAST [20] E-value score between protein sequences. BLAST E-values are a standard measure for deciding whether two proteins are orthologs. Note that the "perfect" alignment should minimize centrality, membership and sequence differences between nodes. Hence, the similarity score between nodes i and j, S(i, j), is computed as follows:
(5)
S is N × M -sized matrix and S(i, j) is an kind of dynamic similarity. Different seed considered among the alignment process leads to different SM (i, j) and S(i, j). The weight α can be adjusted to assign relative importance to biological and topological data, depending upon the confidence level attributed to them and the type of results sought. In our implementation we assign the weight (α = 0.6). This parameter has been discussed in the following section.
Then we present the detailed description of MHA based on the matrix S, and in the following part we define the specific concepts used.
Algorithm MHA (G, H)
Construct the matrices SC, SE and the set SEEDS.
Initialize alignment A to an empty set and alignment score vector B equals to 0.
for a hub seed (h
g
, h
h
) ∈ SEEDS do
Add (h
g
, h
h
) to alignment A, B(h
g
, h
h
) = S(h
g
, h
h
).
for all k ∈ { 1, ..., D} do
Construct a bipartite graph
Compute SM (i, j) relative to the seed and assign the weight ω(i, j) = S (i, j).
Solve the Maximum Weight Bipartite Matching Problem by the Hungarian algorithm [16].
To each optimal matching (u, v) found above, if and only if S (u, v) >B(u), then add (u, v) to the current alignment A, B(u) = S(u, v).
end for
end for
return alignment A.
The kth neighbourhood of node h
g
in network G, , is defined as the set of nodes of G that are at distance ≥ k from h
g
. Hence, is the kth still unaligned neighbourhood and can be thought of as the "ball" of nodes around h
g
up to and including nodes at distance k. This allows us to model insertions and deletions of nodes in the paths conserved between two networks. D is the longest distance restricted.
A bipartite graph, BP (V
1, V
2, E), is a graph with a node set V consisting of two partitions, V = V
1 ∪ V
2, so that every edge e ∈ E connects a node from V
1 with a node from V
2 ; that is, there are no edges between nodes of V
1 and there are no edges between nodes of V
2 -- all the edges "go across" the node partition.
The set A contains pairs of nodes which are the optimal matching results during the process of MHA. The matrix B records the alignment score when a matching is added to A. While using the multiple hub seeds, the object node in network H to which MHA has matched a node in G can be changeable. Membership values are related to the current aligned seeds. Beginning with different seeds, we will get different similarity score influenced by membership values, then a node in G should be aligned to the node that they can get the highest similarity score together. The significance of dynamical and changeable membership value is that a node can select the best associated hub seed to form a functional module, and also several smaller modules in one network will never be covered constrainedly by a larger one in another network along with the alignment process.
Computational complexity of MHA
NodeLand [19] algorithm for determination of centralities of nodes is structurally similar to a breadth-first search, therefore the worst-case runtime complexity is O(N*(N + |E|)), respectively, where N is the number of nodes and |E| is the number of edges in the network. The presented ProportionalHill [19] method has a runtime complexity of O(d*|E|*h), with d being the average node degree and h being the number of the identified hubs. Solving the alignment for bipartite graph BP (V
1, V
2, E) takes O((|V
1|+|V
2|)*(|E|+(|V
1|+|V
2|)*log((|V
1|+|V
2|)))). Therefore, the total time complexity of MHA algorithm for aligning networks G = (V
G
, E
G
) and H = (V
H
, E
H
) is smaller than O(h*|V
G
|*(|E
G
|+|V
G
|*log(|V
G
|))) and the space complexity is O(|V
G
|*|V
H
|+|E
G
|+|E
H
|) clearly.