- Research
- Open Access

# Ranking and compacting binding segments of protein families using aligned pattern clusters

- En-Shiun Annie Lee
^{1}Email author and - Andrew KC Wong
^{1}

**11 (Suppl 1)**:S8

https://doi.org/10.1186/1477-5956-11-S1-S8

© Lee and Wong; licensee BioMed Central Ltd. 2013

**Published:**7 November 2013

## Abstract

### Background

Discovering sequence patterns with variation can unveil functions of a protein family that are important for drug discovery. Exploring protein families using existing methods such as multiple sequence alignment is computationally expensive, thus pattern search, called motif finding in Bioinformatics, is used. However, at present, combinatorial algorithms result in large sets of solutions, and probabilistic models require a richer representation of the amino acid associations. To overcome these shortcomings, we present a method for ranking and compacting these solutions in a new representation referred to as Aligned Pattern Clusters (APCs). To tackle the problem of a large solution set, our method reveals a reduced set of candidate solutions without losing any information. To address the problem of representation, our method captures the amino acid associations and conservations of the aligned patterns. Our algorithm renders a set of APCs in which a set of patterns is discovered, pruned, aligned, and synthesized from the input sequences of a protein family.

### Results

Our algorithm identifies the binding or other functional segments and their embedded residues which are important drug targets from the cytochrome c and the ubiquitin protein families taken from Unitprot. The results are independently confirmed by pFam's multiple sequence alignment. For cytochrome c protein the number of resulting patterns with variations are reduced by 76.62% from the number of original patterns without variations. Furthermore, all of the top four candidate APCs correspond to the binding segments with one of each of their conserved amino acid as the binding residue. The discovered proximal APCs agree with pFam and PROSITE results. Surprisingly, the distal binding site discovered by our algorithm is not discovered by pFam nor PROSITE, but confirmed by the three-dimensional cytochrome c structure. When applied to the ubiquitin protein family, our results agree with pFam and reveals six of the seven Lysine binding residues as conserved aligned columns with entropy redundancy measure of 1.0.

### Conclusion

The discovery, ranking, reduction, and representation of a set of patterns is important to avert time-consuming and expensive simulations and experimentations during proteomic study and drug discovery.

## Keywords

- Protein Analysis
- Protein Function Identification
- Pattern Discovery
- Pattern Clustering
- Hierarchical Clustering
- Pattern Search
- Motif Finding
- Local Alignment
- Drug Discovery

## Introduction

Figure 1: An intuitive example from the cytochrome c protein showing parts of the protein sequence that represent the binding sites.

In Bioinformatics, two common approaches for identifying the protein family's function are by multiple sequence alignment and by motif finding. Multiple sequence alignment aligns a set of protein sequences from the same protein family in order to identify important regions and sites in the resulting alignment. Common multiple sequence alignments include Clustal Omega [9], T-Coffee [10], DIALIGN [11] and HMMER [12]. However, finding the global optimal alignment is computationally expensive, and is known in computational complexity as an NP-complete problem [13]. Even with approximate heuristics added, multiple sequence alignment is not efficient in handling large datasets. Moreover, this approach is only appropriate for highly similar sequences, but not for sequences with considerable dissimilarity. Therefore, instead of aligning the entire sequence globally, it is only suitable to identify similarities locally. Thus, the suspected consensus regions have to be located and preprocessed ahead of alignment.

Another approach for identifying the protein family's function by similar local subsequences [14] is called motif finding, which builds motifs into combinatorial models and probabilistic models. The combinatorial model identifies commonly repeated sequence patterns exhaustively [15–17]. Work reported in Pevzner et al. [18] and Mandoiu et al. [19] created cliques where vertices are sequence patterns, edges connect similar sequence patterns, and complete graphs represent the best consensus patterns. However, these combinatorial methods are computationally intensive [20, 21] and produce too many possible candidates. The probabilistic model commonly uses the position weight matrix (PWM), which estimates an amino acid at each position while assuming that each position is independent [22, 23]. An alternative random sequence synthesis takes further frame-shifted position into consideration by optimally aligning amino acids to create a probabilistic sequence [24, 25]. Other probabilistic methods make use of the Markov model, where the current state depends on a specified set of the past states. One such example is the popular pFam database [26], which builds a profile Hidden Markov Model (HMM) from the multiple sequence alignment of a protein family for classifying proteins and predicting their functionality. In general, the probabilistic models compress the data into probability distributions and express amino acid associations as an ordered set of random variables.

To overcome these limitations, we approached the problem from a data mining perspective where we first considered the occurrences and strength of the sequence patterns. We began by identifying a set of statistically strong sequence patterns and developed an Aligned Pattern (AP) Synthesis Process to align and cluster similar patterns into a reduced set of Aligned Pattern Clusters (APCs) for representing the similar sequence patterns that might be associated with binding segments. These APCs capture both the statistically significant sequence association of amino acids as well as their conservations on each of the aligned columns. More precisely, our APC Process aligns and groups similar sequence patterns with variations to form a cluster of Aligned Patterns called APCs. We then examined whether or not the APCs correspond to the binding segment and binding residues that reflect the protein's functionality. This paper is an expansion of *Lee et. al* [27] with an expansion upon reduction and ranking of the results. The three ranking presented are coverage, quality, and standard residual.

When our APC Process was applied to the cytochrome c and ubiquitin protein families, we discovered a reduced set of APCs solutions, which corresponds to the functional binding segments and binding residues of both families. Our APC Process obtained a set of solutions smaller when compared to the combinatorial methods, rendering a more compact yet knowledge-rich representation in the form of the APC than the probabilistic method. Having a smaller set of richer representation is crucial in identifying the drug targets for drug discovery.

## Methodology

Figure 2: A text example using the English alphabet illustrates the problem of sequence patterns with variations. It was created to demonstrate each step of the process succinctly. This text example will be repeated throughout the paper. The overall APC Process contains two steps: the PD Step, and the APC Step. The final result is a list of APCsordered by their ranking.

### The input sequences

To begin, the input sequence is built from the alphabet Σ contains a set of characters $\left\{{\sigma}_{1},{\sigma}_{2}\dots ,{\sigma}_{\left|\text{\Sigma}\right|-1},{\sigma}_{\left|\text{\Sigma}\right|}\right\}$. As an example, the English alphabet contains 26 characters, {'a', 'b', ..., 'y', 'z'} = Σ, mathematically, *σ*
_{1} ='a', *σ*
_{2} ='b', . . ., *σ*
_{25} ='y', *σ*
_{26} ='z', and *|* Σ*|* = 26.

**A single sequence** Let *sk* be a sequence indexed by *k* composed of consecutive elements taken from the alphabet Σ. ${s}^{k}={s}_{1}^{k}{s}_{2}^{k}\dots {s}_{\left|{s}^{k}\right|-1}^{k}{s}_{\left|{s}^{k}\right|}^{k}$, where each ${s}_{i}^{k}\in \text{\Sigma}$ and *s*^{
k
} is of length *|s*^{
k
}
*|*. For example, aaaaaaaaaaaaHELLOaaaaaaaaaaaa is a sequence of length 29. This sequence can be represented by *s* 1, where *|s*^{1}
*|* = 29, and the character at position 13 is ${s}_{13}^{1}=\text{H}$.

**A set of sequences**Let $\mathbb{S}=\left\{{s}^{k}|k=1,\dots ,|\mathbb{S}|\right\}=\left\{{s}^{1},{s}^{2},\dots {s}^{\left|\mathbb{S}\right|-1},{s}^{\left|\mathbb{S}\right|}\right\}$ be the set of sequences that represents the set of the input sequences, also called the data space, where $\left|\mathbb{S}\right|$ is the total number of input sequences, and each sequence having the length of $\left|{s}^{1}\right|,\left|{s}^{2}\right|,\dots ,\left|{s}^{\left|\mathbb{S}\right|-1}\right|,\phantom{\rule{2.77695pt}{0ex}}\left|{s}^{\left|\mathbb{S}\right|}\right|$ respectively. Let each sequence, say sequence

*k*, be ${s}^{k}={s}_{1}^{k}\dots {s}_{j}^{k}\dots {s}_{\left|{s}^{k}\right|}^{k}$, where

*sk*∈ Σ is the elements at position

*j*of sequence

*k*. Together the data space is the set of sequences is

### The pattern discovery step

The PD Step is a previously developed pattern discovery and pruning algorithm [28] that obtains a condennse list of significant patterns from the family of protein sequences.

**The pattern** In this paper, we consider a pattern as a statistically significant and non-redundant pattern as defined in Wong et. al [28].

**Definition 1** *A pattern* ${\overline{p}}^{i}={s}_{ji}^{i}{s}_{ji+1}^{i}\dots {s}_{ji+\left|{\overline{p}}^{i}\right|-1}^{i}$ *is a short sequence over* Σ *where* $\left|{\overline{p}}^{i}\right|$ *is its order (or length). The sequence association is statistically significant and non-redundant in the sense that it is deltaclosed (i.e. it is not covered by a statistically significant super-pattern) and non-induced, (i.e. its statistical significance is not induced by its statistically strong sub-patterns ). A pattern*
${\overline{p}}^{i}$
*is discovered by passing four statistical conditions defined in Wong et. al* [28].

*An* UNALIGNED PATTERN ${\overline{p}}^{i}$
*is discovered by passes four statistical conditions defined in Wong et. al* [28].

An occurrence of the pattern ${\overline{p}}^{i}$ is expressed as $occ\left({\overline{p}}^{i}\right)={j}_{i}$ such that ${\overline{p}}^{i}={s}_{{j}_{i}}^{i}{s}_{{j}_{i}+1}^{i}\dots {s}_{{j}_{i}+\left|{\overline{p}}^{i}\right|-1}^{i}$,

*i*is the index of the sequence in which that pattern occurs, and

*j*

_{ i }is the starting index in that sequence

*si*where the pattern begins.

Example of patterns ${\overline{p}}^{1}$ =HELLO and ${\overline{p}}^{2}$ =MELLOW.

$S$ | The Input Sequences |
---|---|

| bdxejrtewkwkHELLOkcmstsjavtpi |

| nfixtHELLOuzdovcaaxnkjfjcvwk |

| dimtndvkjmkHELLObkcmstsj |

| tzhgarzofdHELLOpwkxmc |

| tyjxjqnyHELLOwmopemlqfgptnwnq |

| kntywtoaxMELLOWbtiasycma |

| jilxchitivMELLOWriiiweyfzgvuyaa |

| hmlzvMELLOWorgfeb |

| xhmlzvqgcanyMELLOWgbfj |

| vqgcanyffcMELLOWvcnsnjvalbdvr |

**Data induced by the unaligned pattern** Let $\mathbb{D}\left({\overline{p}}^{i}\right)$, be each of the unique occurrences of the pattern, ${\overline{p}}^{i}$, found in the input sequence. We call $\mathbb{D}\left({\overline{p}}^{i}\right)$ the data induced by ${\overline{p}}^{i}$ or the induced data of ${\overline{p}}^{i}$. We will return to the concept on the data induced by pattern when we use it to compute the measures for aligned columns within the context of APC.

### The aligned pattern clustering step

For the APC Step, we developed an algorithm that gathers a set of similar patterns of different lengths obtained from the PD Step while aligning them into patterns of the same length by inserting gaps and wildcards. Constrained by the statistical sequence association, the corresponding elementNames in this cluster of patterns are lined up into columns, thus reflecting their conservation and variation [27].

*C*

_{1}and APC

*C*

_{2}, thereby creating the new APC

*C*

_{3}.

Figure 3: In one iterative step of hierarchical clustering, an existing APC, *C*
_{1} with *m* = 3 and *n* = 6, is merged with another APC, *C*
_{2} with *m* = 3 and *n* = 5, to result in the new APC, *C*
_{3}, which is extended to *m* = 6 and *n* = 6.

**Definition 2**
*A set of*
$APC\phantom{\rule{2.77695pt}{0ex}}\u2102=\left\{{C}^{l}|l=1,\dots ,|\u2102|\right\}=\left\{{C}^{1},{C}^{2},\dots ,{C}^{\left|\u2102\right|-1},{C}^{\left|\u2102\right|}\right\}$

*An APC, C*

^{ l }

*, is a set of similar horizontal sequence patterns that have been optimally grouped and vertically aligned into a set of patterns*${\mathbb{P}}^{l}=\left\{{p}^{1},{p}^{2},\dots {p}^{m}\right\}$

*represented by C*

^{ l }

*, which is expressed as*

*where*${s}_{j}^{i}\in \text{\Sigma}\cup \left\{-\right\}\cup \left\{*\right\}$ *is an amino acid in the pattern, p*^{
i
}
*, in an aligned column j. Each patterns of C*^{
l
} *is aligned into length |C*^{
l
}
*| = n, and there is a set of* $\left|{\mathbb{P}}^{l}\right|\phantom{\rule{2.77695pt}{0ex}}=m$ *patterns (rows) in C*^{
l
}.

In the text example in Figure 3, *C*
_{1} with *m* = 3 and *n* = 6, is merged with another APC, *C*
_{2} with *m* = 3 and *n* = 5, to result in the new APC, *C*
_{3}, which is extended to *m* = 6 and *n* = 6.

**Definition 3** *An Aligned Pattern, which will simply be referred to as a pattern from this point forward, is a sequence of order-preserving amino acids maximizing the similarity of the patterns against a set of pattern from an APC,* ${\mathbb{P}}^{l}$ *of size*
$\left|{\mathbb{P}}^{l}\right|\phantom{\rule{2.77695pt}{0ex}}=m$
*with gaps, wildcards, and mismatches. Let* ${p}^{i}\in {\mathbb{P}}^{l}$ *be* ${s}_{1}^{i}{s}_{2}^{i}\dots {s}_{\left|{p}^{i}\right|}^{i}$, *where*
${s}_{j}^{i}\in \text{\Sigma}\cup \left\{-\right\}\cup \left\{*\right\}$
*is an amino acid in the pattern pi and in the aligned column index c*
_{
j
}.

**Definition 4** *An* aligned column *cj in C*^{
l
} *represents the j*^{
th
} *column of amino acids that have been aligned from the set of patterns contained in the current APC, C*^{
l
} = ( *c*
_{1} *c*
_{2} *… c*
_{
n
})*. A conserved* ALIGNED COLUMN *is conserved to only one type of amino acid such that c*
_{
j
} = [*σ ... σ ... σ*]^{
T
} *where σ* ∈ Σ.

*p*

^{3}=

*HELLO*and the aligned column for the first position is

*c*

_{1}= [

*BM HBBH*]

^{ T }

Example of an APC for the text example.

| ${\left(c1\phantom{\rule{1em}{0ex}}c2\phantom{\rule{1em}{0ex}}c3\phantom{\rule{1em}{0ex}}c4\phantom{\rule{1em}{0ex}}c5\phantom{\rule{1em}{0ex}}c6\right)}_{1\times 6}$ | |
---|---|---|

${\left(\begin{array}{c}p1\\ {p}^{2}\\ {p}^{3}\\ {p}^{4}\\ {p}^{5}\\ {p}^{6}\end{array}\right)}_{6\times 1}$ | = | ${\left(\begin{array}{c}B\phantom{\rule{1em}{0ex}}\phantom{\rule{2.77695pt}{0ex}}E\phantom{\rule{1em}{0ex}}L\phantom{\rule{1em}{0ex}}L\phantom{\rule{1em}{0ex}}O\phantom{\rule{1em}{0ex}}W\\ M\phantom{\rule{1em}{0ex}}E\phantom{\rule{1em}{0ex}}L\phantom{\rule{1em}{0ex}}L\phantom{\rule{1em}{0ex}}O\phantom{\rule{1em}{0ex}}W\\ H\phantom{\rule{1em}{0ex}}E\phantom{\rule{1em}{0ex}}L\phantom{\rule{1em}{0ex}}L\phantom{\rule{1em}{0ex}}O\phantom{\rule{1em}{0ex}}*\\ B\phantom{\rule{1em}{0ex}}A\phantom{\rule{1em}{0ex}}L\phantom{\rule{1em}{0ex}}L\phantom{\rule{1em}{0ex}}\phantom{\rule{2.77695pt}{0ex}}S\phantom{\rule{1em}{0ex}}*\\ B\phantom{\rule{1em}{0ex}}A\phantom{\rule{1em}{0ex}}L\phantom{\rule{1em}{0ex}}K\phantom{\rule{1em}{0ex}}\phantom{\rule{2.77695pt}{0ex}}S\phantom{\rule{1em}{0ex}}*\\ H\phantom{\rule{1em}{0ex}}A\phantom{\rule{1em}{0ex}}L\phantom{\rule{1em}{0ex}}-\phantom{\rule{1em}{0ex}}\phantom{\rule{2.77695pt}{0ex}}S\phantom{\rule{1em}{0ex}}*\end{array}\right)}_{6\times 6}$ |

**Data induced by apc** Let $\mathbb{D}\left({C}^{l}\right)$ be data induced by the APC *C*^{
l
}, which is the subset of segments from the input sequences, or the data subspace containing all the pattern from the APC, *Cl*, where its corresponding ${\mathbb{P}}^{l}={\left\{{p}^{1},{p}^{2},\dots {p}^{m}\right\}}^{T}$. We call $\mathbb{D}\left({C}^{l}\right)$ the data induced by *C*^{
l
} or the induced data of *C*^{
l
}. Then $D\left({C}^{l}\right)$ is then the union of the segements from the input sequences induced by all the patterns contained in *C*^{
l
}, $\mathbb{D}\left({C}^{l}\right)=\mathbb{D}\left({p}^{1}\right)\cup \mathbb{D}\left({p}^{2}\right)\cup \cdots \cup \mathbb{D}\left({p}^{m}\right)={\displaystyle \bigcup _{\forall pi\in {\mathbb{P}}^{l}}}\mathbb{D}\left({p}^{i}\right)$

### Measuring and ranking results

#### The three measures of APCs

In order to rank the set of constructed APCs, $\u2102$, three measures are computed for each APC, *C*^{
l
}. The three measures are Coverage, APC Quality, and Standard Residual.

**Coverage** The coverage of an APC accounts for the total input sequences that are covered by the APC, *C*^{
l
}, over the entire set of input sequences. Note that this is also counting the number of occurrences in the induced dataspace $\mathbb{D}\left({C}^{l}\right)$.

**APC Quality**The APC Quality,

*Q*, is the average column entropy subtracted from one, where entropy is computed from the set of Aligned Patterns, ${\mathbb{P}}^{l}\in {C}^{l}$. The APC Quality measures the stability or reliability of a APC, whereas the entropy measures the randomness or the degree of variation within an APC. The value of

*Q*approaches one while the resulting APC is more stable. The value of

*Q*approaches zero while the resulting APC is more random.

*Q*is expressed as:

*cj*is the aligned column in the resulting APC.

where *σ* ∈ Σ ∪ {−} ∪ {*} is the amino acid ${s}_{i}^{j}$ of *p*
_{
i
} at *c*
_{
j
}, and the probability *Pr*(*c*
_{
j
} *= σ*) is computed from counting the subset of patterns in ℙ^{
l
}.

**Standard Residual**The Standard Residual measures the statistical significance of the APC by comparing the actual number of occurrences,

*o*, of all the patterns included in the APC, against the expected number of occurrences,

*e*, which is computed from the default random model of APC. It is written as

*o*is the actual number of occurrences of the pattern in P

^{ l }counted from the input data, $\mathbb{D}\left({C}^{l}\right)$ and

*e*is the expected number of occurrences computed from the default random model of APC,

*C*, by assuming that each of the aligned columns

*c*

_{ j }are independent and identically distributed (i.i.d.) shown below:

*N*is the length of the input sequence and each of the aligned columns

*c*

_{ j }∈

*C*is i.i.d. To compute the default probability of the aligned columns,

*Pr*(

*cj*), sum the probability of all the possible amino acids in the one single aligned columns. $Pr\left({c}_{j}\right)=Pr\left({c}_{j}={\sigma}_{\text{1}}\right)+Pr\left({c}_{j}={\sigma}_{\text{2}}\right)+\cdots +Pr\left({c}_{j}={\sigma}_{k}\right)={\displaystyle \sum _{\forall {\sigma}_{k}\in {c}_{j}}}Pr\left({c}_{j}={\sigma}_{k}\right)$, where $Pr\left({c}_{j}={\sigma}_{k}\right)=\frac{1}{20}$ for each

*σ*

_{ k }is i.i.d. Returning to the text example with 6 patterns and 6 aligned columns and the English alphabet, $Pr\left({c}_{\text{1}}\right)=Pr\left({c}_{\text{1}}=B\right)+Pr\left({c}_{\text{1}}=M\right)+Pr\left({c}_{\text{1}}=H\right)=\frac{3}{26}$. Therefore, the final expectation is

#### The redundancy measure of the aligned columns

*R*1(

*cj*) for the aligned column

*cj*is

where *H*(*c*
_{
j
}) with *Pr*(*c*
_{
j
} *= σ*) being computed from counting of *σ* in the aligned column, *c*
_{
j
}, of the entire input sequences, $\mathbb{D}\left({C}^{l}\right)$. Hence, a conserved aligned column has *R* 1(*c*
_{
j
}) = 1 since minimum entropy value of *H*(*c*
_{
j
}) = 0. Similarly a variable aligned column has *R* 1(*c*
_{
j
}) = 0 the maximum entropy value of *H*(*c*
_{
j
}) = 1. If the amino acid occurrences in $\mathbb{D}\left({C}^{l}\right)$ are equiprobable.

Note that the entropy of the Redundancy Measure is computed from the entropy of the induced data, $\mathbb{D}\left({C}^{l}\right)$, whereas the entropy of the APC Quality uses the amino acids from the patterns in ${\mathbb{P}}^{l}$. This is because the quality of the APC measures how much variation or stability is in the patterns, whereas the redundancy of the aligned column measures how much the redundancy or consistency is in the induced data.

## Results and discussions

We applied our APC Process on the cytochrome c and the ubiquitin protein families in order to examine how the resulting APCs relate to the binding sites, which are the biologically significant regions of the protein. There are three aspects we would like to explore: the reduction of the set of candidate solutions from the discovered patterns to APCs obtained; how each pattern in the APC surrounding the binding site represents a binding segment in a single strand of protein; and how binding residues correlate to their aligned column. Finally, we display our results underneath the pFam multiple sequence alignment to compare the differences in the representations. In the comparison, we demonstrate the overall hierarchical clustering performance of our APC Process as well as the quality of the resulting APCs.

### Cytochrome C results

First, we demonstrated that by grouping similar patterns together, the APC reduces the number of candidate solutions to be examined without losing information. Next, we showed that in the binding APCs, each pattern represents a binding segment in the protein sequence and each of the two binding sites is represented by a specific aligned column. The 317 sequences from the cytochrome c protein family were obtained on September 17th, 2012 from Uniprot by searching the following terms: cytochrome c; AND reviewed:yes; AND name:c*; AND mnemonic:c*; AND (name:cytochrome AND name:c); NOT name:type; NOT name:VPR; NOT name:biogenesis; NOT name:*ase; NOT (name:cytochrome AND name:b*); NOT like; NOT proba*; AND fragment:no; AND active:yes. These selected parameters should help to yield a reasonable number of input sequences for the APC Process. From these 317 input sequences, the PD Step was executed with the *minimal order* of 5, the *minimum occurrence* of 20, and the *delta* of 0.9. The PD Step discovered 154 patterns from the cytochrome c protein family, where 28 patterns, or 18.18% of the total patterns, contain the proximal binding site, His18, and 23 patterns, or 14.94% of the total patterns, contain the distal binding site, Met62, resulting in a combined total of 33.12% of the discovered patterns that contains one of the two binding sites. Therefore, the set of patterns redundantly covers the two binding sites. This observation indicates that each individual pattern alone covers only a small fraction of the input sequences in the data space; therefore, a single pattern by itself cannot fully represent the rich variations of all the input sequences within the entire protein family. Hence, the APC, which contains a set of similar patterns that has been grouped and aligned to allow variations, provides a reduced and much richer representation of the binding segments and binding residues.

In the APC Step, we showed that our APC Process reduced the number of candidate solutions without losing any information and richly captured the binding sites in the compact APCs where the binding segments are the patterns therein and the binding sites are the conserved aligned columns. We ensure that all the patterns discovered are strongly statistically significant by starting with a tighter configuration to ensure the quality of the result. From this list of 154 statistically significant and non-redundant patterns obtained from the previous PD Step, the APC Step was executed with the following settings: the Merge *Algorithm* as Global Alignment, the SIMILARITY *Score* as Hamming Distance, the TERMINATION *Condition* Score less than 0.8, the heuristics column distribution score greater than 0.8 and the minimum of three overlapping column matches.

The 36 APCs of the Cytochrome C Family Ranked by Standard Residual (where *m* = the number of patterns in the APC, and *n* = length of the APC)).

APC (as regular expressions) | m | n | Quality | Coverage | Standard Residual | Binding Site | |
---|---|---|---|---|---|---|---|

1 | WGEDTLMEYLENPKKYIPGTK | 8 | 30 | 0.57 | 81 | 5.92E+16 | Met62 |

2 | MGDVEKGKKIFVQ[KR]CAQC | 19 | 33 | 0.43 | 119 | 5.04E+16 | His18 |

3 | QC | 7 | 28 | 0.41 | 46 | 8.32E+14 | His18 |

4 | TLYDYLLNPKKYIPGTK | 8 | 27 | 0.44 | 116 | 1.91E+14 | Met62 |

5 | GAGHK[QVT]GPNL[NH]GLFGRQSGTT | 13 | 21 | 0.4 | 125 | 3.53E+10 | |

6 | GFSYTDANKNKGITWGE | 8 | 17 | 0.41 | 66 | 6.33E+08 | |

7 | GEKIFKTKCAQC | 3 | 15 | 0.57 | 24 | 6.45E+07 | His18 |

8 | MGDVEKGKKIFVQKC | 7 | 15 | 0.4 | 53 | 5.04E+07 | |

9 | GPNLHGLFGRKTGQA | 4 | 15 | 0.43 | 46 | 4.37E+07 | |

10 | ERADLIAYLK[KE]ATNE | 9 | 15 | 0.4 | 91 | 3.53E+07 | |

11 | HGLFGRKTGQAPGF | 9 | 14 | 0.46 | 70 | 2.10E+07 | |

12 | IPGTK | 4 | 13 | 0.42 | 136 | 9.06E+06 | Met62 |

13 | AANKNKGITWGE | 4 | 12 | 0.5 | 54 | 1.60E+06 | |

14 | LHGLFGR[QK]SGTT | 6 | 12 | 0.42 | 88 | 1.07E+06 | |

15 | AGYSYSAANKN | 5 | 11 | 0.43 | 30 | 1.40E+05 | |

16 | TLYDYLLNP | 2 | 9 | 0.56 | 29 | 2.69E+04 | |

17 | GQAPGFSY | 2 | 8 | 0.5 | 27 | 5.57E+03 | |

18 | TK | 2 | 7 | 0.57 | 52 | 3.38E+03 | Met62 |

19 | GGKHKTG | 2 | 7 | 0.43 | 64 | 2.94E+03 | |

20 | EKGKKIF | 2 | 7 | 0.43 | 62 | 2.85E+03 | |

21 | FAGLKKP | 3 | 7 | 0.48 | 57 | 2.62E+03 | |

22 | WGGGKIY | 2 | 7 | 0.71 | 27 | 2.48E+03 | |

23 | FAGIKKK | 2 | 7 | 0.43 | 51 | 2.34E+03 | |

24 | YLKKAT | 1 | 6 | 1 | 29 | 1.19E+03 | |

25 | WGEDTL | 1 | 6 | 1 | 25 | 1.02E+03 | |

26 | NCAAC | 2 | 6 | 0.83 | 30 | 8.68E+02 | His18 |

27 | KGAGHK | 2 | 6 | 0.83 | 26 | 7.52E+02 | |

28 | KGITW | 1 | 5 | 1 | 49 | 4.46E+02 | |

29 | GFSYT | 1 | 5 | 1 | 42 | 3.83E+02 | |

30 | FVQKC | 1 | 5 | 1 | 39 | 3.55E+02 | |

31 | DANKN | 1 | 5 | 1 | 34 | 3.10E+02 | |

32 | GYSYT | 1 | 5 | 1 | 28 | 2.55E+02 | |

33 | A | 1 | 5 | 1 | 24 | 2.19E+02 | Met62 |

34 | C | 1 | 5 | 1 | 22 | 2.00E+02 | His18 |

35 | FKTRC | 1 | 5 | 1 | 20 | 1.82E+02 | |

36 | LFEYL | 1 | 5 | 1 | 20 | 1.82E+02 |

Comparing the Number of APCs and Patternss.

Patterns Count | %overall | APCs Count | %overall | %Reduction | |
---|---|---|---|---|---|

His18 | 28 | 18.18% | 5 | 13.89% | 82.14% |

Met62 | 23 | 14.94% | 5 | 13.89% | 78.26% |

Total | 154 | 33.12% | 36 | 27.78% | 76.62% |

Comparing the Top Four APCs and their Patterns.

Pattern Count | APCs Count | %Reduction | |
---|---|---|---|

His18 | 26 | 2 | 92.31% |

Met62 | 16 | 2 | 87.50% |

Total | 42 | 4 | 88.10% |

### Cytochrome C discussion

Figure 4: One three-dimensional structure from the cytochrome c protein family, PDB ID 1F1F, is displayed. The top-two statistically significant APCs from the cytochrome c protein are the proximal binding segment (in pink) and the distal binding segment (in blue) that bind the heme from above and below the horizontal plane, respectively. More specifically, one specific amino acid from each of the two segments binds the iron molecule from the centre of the heme: the "H" (Histidine) residue at position 18 of the proximal segment and the "M" (Methionine) residue at position 62 of the distal segment.

The Distal APC of the Cytochrome C Family.

patterns | Count | Score |
---|---|---|

WGEDTLMEYLENPKKYIPGTKMIF****** | 22 | 1.94E+03 |

***DTLMEYLENPKKYIPGTKM******** | 26 | 1.30E+03 |

*******EYLENPKKYIPGTKMIFAGIKK* | 35 | 2.54E+02 |

****TLMEYLENPKKYIPGTKMIFAGIKKK | 29 | 7.34E+02 |

****TLMEYLENPKKYIPGTKMIFAG**** | 34 | 4.81E+01 |

********YLENPKKYIPGTKM******** | 81 | 6.51E+02 |

*******EYLENPKKYIPGTKMIFAG**** | 42 | 5.44E+01 |

*******EYLENPKKYIPGTKM******** | 65 | 2.88E+01 |

The Proximal APC of the Cytochrome C Family.

patterns | Count | Position |
---|---|---|

******GKKIFVQKCAQCHTV********* | 23 | 6.27E+04 |

****EKGKKIFVQKCAQCHT********** | 23 | 1.32E+04 |

MGDVEKGKKIFVQKCAQCHTVEKGGKHKTG | 20 | 7.50E+07 |

******GKKIFVQKCAQCHTVEKGGKHKTG | 20 | 1.16E+06 |

*************KCAQCH*********** | 57 | 1.59E+01 |

**************CAQCH*********** | 89 | 2.58E+03 |

*************RCAQCHT********** | 21 | 1.38E+01 |

**************CAQCHT********** | 76 | 3.01E+01 |

**********FVQKCAQCHTVE******** | 27 | 5.88E+02 |

************QKCAQCHT********** | 32 | 6.38E+01 |

************QKCAQCHTVEKGGKHKTG | 23 | 6.33E+04 |

*************KCAQCHTVEKG****** | 30 | 4.91E+01 |

*************KCAQCHTV********* | 51 | 1.73E+01 |

**************CAQCHTV********* | 65 | 3.10E+01 |

**************CAQCHTVEK******* | 34 | 1.30E+01 |

**************CAQCHTVE******** | 49 | 2.41E+01 |

****************QCHTV********* | 95 | 2.33E+03 |

****************QCHTVEKGG***** | 45 | 1.75E+01 |

****************QCHTVE******** | 77 | 3.15E+01 |

In conclusion, the APC can represent protein functions such as the binding segments and binding residues and presents a reduced set of candidate solutions and specifies their location in the protein family. In cytochrome c, the prevention of binding can block cancer progression, which is an important drug discovery for cancer treatment.

### Ubiquitin results

To further study the APC Step, we closely examined the iterative steps and its resulting APCs using the ubiquitin protein family. The 70 sequences from the ubiquitin protein family used in our experiment were obtained on August 9th, 2012 from Uniprot by searching the following terms: name:ubiquitin; NOT name:*ase; NOT name:like; NOT name:ribosomal; NOT name:modifier; NOT name:factor; NOT name:protein; NOT name:conjugating; NOT name:activating; NOT name:enzyme; AND reviewed:yes; AND mnemonic:UB*.

Figure 5: Ten resulting APCs representing the proximal and distal binding segments of the cytochrome c are compared to the HMM logo from pFam. In the largest APC, Cys17 is identified as one of the conserved aligned columns, where His18 binds to the heme iron. In the second largest APC, Met62 is identified as one of the conserved aligned column of the distal binding segment, where Met62 binds the heme iron.

*minimal order*of 10, the

*minimum occurrence*of 20, and the

*delta*of 0.9 to yield a proper size of the results for the study. Table 8 shows the thirty discovered patterns, where all except five of the patterns contained the seven binding residues. Nevertheless, these patterns still corresponded to the conserved amino acids around the binding residues. Therefore, all the discovered patterns indicate important functionality in the ubiquitin protein family, such as the binding site or the areas next to the binding site. Once again, each pattern on its own occurs only a few times, and has only a low frequency count for representing the binding segments of this protein family. Since protein binding segments exhibit considerable variability, APCs represent the protein family's functional binding sites more explicitly and effectively.

Statistically Ranked Patterns Discovered from the Sequences of the ubiquitin Family.

Ranking | Pattern | Frequency | Score | Binding Residue |
---|---|---|---|---|

1 | MQIFV | 21 | 5.44E+44 | Lys6, Lys11, Lys27, |

QD | Lys29, Lys33, Lys48, | |||

IQ | Lys63 | |||

2 | MQIFV | 15 | 2.86E+44 | Lys6, Lys11, Lys27, |

QD | Lys29, Lys33, Lys48, | |||

IQ | Lys63 | |||

3 | SDTIENV | 24 | 1.25E+33 | Lys27, Lys29, Lys33, |

LEDGRTLSDYNIQ | Lys48, Lys63 | |||

4 | SDTIDNV | 17 | 7.59E+32 | Lys27, Lys29, Lys33, |

LEDGRTLADYNIQ | Lys48, Lys63 | |||

5 | MQIFV | 17 | 4.76E+31 | Lys6, Lys11, Lys27, |

QD | Lys29, Lys33, Lys48 | |||

6 | IENV | 32 | 3.48E+31 | Lys27, Lys29, Lys33, |

EDGRTLSDYNIQ | Lys48, Lys63 | |||

7 | V | 17 | 1.59E+30 | Lys6, Lys11, Lys27, |

| Lys29, Lys33, Lys48 | |||

8 | TITLEVEPSDTIENV | 24 | 8.80E+28 | Lys27, Lys29, Lys33, |

QQRLIFAG | Lys48 | |||

9 |
| 39 | 7.43E+27 | Lys29, Lys33, Lys48, |

SDYNIQ | Lys63 | |||

10 |
| 44 | 3.66E+23 | Lys33, Lys48, Lys63 |

NIQ | ||||

11 | IPPDQQRLIFAG | 20 | 3.38E+23 | Lys48, Lys63 |

| ||||

12 | NV | 36 | 6.15E+21 | Lys27, Lys29, Lys33, |

DGRTLSDYNI | Lys48 | |||

13 |
| 44 | 5.20E+18 | Lys29, Lys33, Lys48 |

LSDYN | ||||

14 |
| 19 | 2.23E+16 | Lys29, Lys33, Lys48 |

LAD | ||||

15 |
| 19 | 8.01E+15 | Lys6, Lys11, Lys27, |

| Lys29, Lys33 | |||

16 | MQIFV | 25 | 1.17E+15 | Lys6, Lys11, Lys27 |

17 | MQIFV | 23 | 8.48098E+14 | Lys6, Lys11, Lys27 |

18 | DYNIQ | 62 | 2.40964E+11 | Lys63 |

19 | MQIFV | 60 | 17382565255 | Lys6, Lys11 |

20 |
| 26 | 1135719784 | Lys6, Lys11 |

21 | LEVESSDTIDNV | 26 | 7757459.08 | Lys27 |

22 | TITLEVEPS | 28 | 28304.96142 | |

23 |
| 67 | 3796.714675 | Lys6, Lys11 |

24 | DGRTLAD | 23 | 1298.702247 | |

25 | STLHL | 69 | 1102.599421 | |

26 |
| 67 | 315.8836468 | Lys11 |

27 | IENV | 38 | 309.1891137 | Lys27 |

28 | VEPSD | 28 | 260.0761993 | |

29 | TLADY | 23 | 191.1286116 | |

30 | IDNV | 29 | 180.0682775 | Lys27 |

*Algorithm*as Global Alignment, the SIMILARITY

*Score*as Hamming Distance, the TERMINATION

*Condition*Score less than 0.3, the heuristics column distribution score greater than 0.3 and the minimum of three overlapping column matches. We demonstrated the efficacy of our APC Process by showing the reduced set of 9 APCs and their binding sites (Table 9).

The 36 APCs of the ubiquitin Family Ranked by Standard Residual (where *m* = the number of patterns in the APC, and *n* = length of the APC)).

APC (as regular expressions) | m | n | Quality | Coverage | Standard Residual | Binding Site | |
---|---|---|---|---|---|---|---|

1 | MQIFVKTLTGKTITLEVE[SP]S | 10 | 76 | 0.31 | 61 | 4.7E+39 | Lys6, Lys11, |

DTI[DE]NVKAKIQDKEGIPPDQ | Lys27, Lys29, | ||||||

QRLIFAGKQLEDGRTL[SA]DYN | Lys33, Lys48, | ||||||

IQKESTLHLVLRLRGG | Lys63 | ||||||

2 | NVKAKIQDKEGIPPDQQRLIFAG | 5 | 52 | 0.5 | 67 | 3.3E+29 | Lys27, Lys29, |

KQLEDGRTL[SA]DYNIQKESTL | Lys33, Lys48, | ||||||

HLVLRLRGG | Lys63 | ||||||

3 | MQIFVKTLTGKTITLEVEP[SP] | 5 | 27 | 0.34 | 67 | 2.7E+14 | Lys6, Lys11, |

DTI[ED]NVK | Lys27 | ||||||

4 | DYNIQKESTLHLVLRLRGG | 1 | 19 | 1 | 62 | 2.2E+12 | Lys63 |

5 | LEVE[SP]SDTIDNVK | 2 | 13 | 0.31 | 54 | 1.0E+07 | Lys27 |

6 | KTITLEVEPS | 2 | 10 | 0.4 | 68 | 4.0E+05 | Lys11, Lys27 |

7 | DGRTLADY | 2 | 8 | 0.5 | 24 | 1.4E+04 | |

8 | STLHL | 1 | 5 | 1 | 69 | 1.7E+03 | |

9 | I[ED]NVK | 2 | 5 | 0.8 | 67 | 1.2E+03 | Lys27 |

### Ubiquitin discussion

Figure 7: The three-dimensional structure of the ubiquitin protein, with PDB ID 1UBQ from the protein data bank, has seven binding residues: Lys6, Lys11, Lys27, Lys29, Lys33, Lys48, and Lys63.

correspond to six of the seven binding residues ()Lys6, Lys11, Lys27, Lys33, Lys48, and Lys63). The remaining Lys33 is found in an APC with only one pattern and thus stands out as a significant functional group with a distinct pattern discovered with high statistical significance in the PD Step.

For ubiquitin, our APCs are pattern alignments that agree with the emission probabilities of the pFam profile HMM (Figure 6). All eight APCs discovered agreed with the pFam HMM emission probability. Surprisingly, our results differs from PROSITE's consensus motif (PDOC00271), which missed 172 ubiquitin proteins. In drug discovery, preventing the linking of ubiquitin to its binding proteins via its binding site inhibits cancer growth.

## Conclusion

Our APC Process greatly reduces the number of APCs in comparison with other methods. This is due to the fact that the APC sstep starts with input patterns from the PD Step rather than the entire input search space. This drastically reduces the search space in a controlled manner. From the application aspect, using data from two Uniprot protein families (cytochrome c and ubiquitin), the majority of top-ranking APCs corresponded to their protein binding segments. The resulting cytochrome c binding APCs agree with the pFam emission probability. An APC represents a set of patterns as the horizontal rows and its aligned columns as the vertical columns, which can be further evaluated for amino acid conservations. In fact, for cytochrome c, the proximal and distal binding residues correspond to conserved aligned columns with R1 of 1.0. In addition, the distal APC identifies one conserved aligned column with R1 of 1.0 as the binding residue, which is not identified in PROSITE or pFam. While the ubiquitin APCs agree with pFam emission probability, six of the seven binding residues are successfully identified in the APC.

In conclusion, APCs can be used to reveal functional domains across different protein families without relying on prior knowledge or clues about the consensus regions. Currently, we are using aligned column variations as amino acid characteristics to classify protein species and gene labels. We are also extending the algorithm to discover interdependencies within APCs and long-distance associations among APCs. In more general cases of protein analysis, the function and the nature of the protein function are not clear; thus, the capability that overcomes such difficulties marks the uniqueness and novelty of our APC Process. In the broader sense, this knowledge is essential for understanding the proteins involved in epigenetics for drug discovery [38]. The development of cancer generally increases with age, and with the ageing baby-boomer population it is crucial for drug companies to find cost-effective and time-saving techniques for drug discovery.

## Declarations

### Acknowledgements

The authors wish to thank their colleagues C. M. Li, D. Yuen, and members of graduate small group for reading this manuscript. This research is supported by an NSERC Post Graduate Scholarship and an NSERC Discovery Grant.

**Declarations**

The publication costs for this article were funded by the corresponding author

This article has been published as part of *Proteome Science* Volume 11 Supplement 1, 2013: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2012: Proteome Science. The full contents of the supplement are available online at http://www.proteomesci.com/supplements/11/S1.

## Authors’ Affiliations

## References

- Peto J:
**Cancer epidemiology in the last century and the next decade.***Nature*2001,**411**(6835):390–395. 10.1038/35077256PubMedView ArticleGoogle Scholar - Forman MS, Trojanowski JQ, Lee VMY: Neurodegenerative diseases: a decade of discoveries paves the way for therapeutic breakthroughs. Nature medicine 2004, (10.10):1055–1063.Google Scholar
- Hughes J, Rees S, Kalindjian S, Philpott K:
**Principles of early drug discovery.***Br J Pharmacol*2011,**162**(6):1239–1249. 10.1111/j.1476-5381.2010.01127.xPubMed CentralPubMedView ArticleGoogle Scholar - Colon W, Wakem LP, Sherman F, Roder H:
**Identification of the Predominant Non-Native Histidine Ligand in Unfolded Cytochrome c.***Biochemistry*1997,**36:**12535–12541. 10.1021/bi971697cPubMedView ArticleGoogle Scholar - Martinou JC, Desagher S, Antonsson B:
**Cytochrome c release from mitochondria: all or nothing.***Nature Cell Biology*2000,**2:**E41-E43. 10.1038/35004069PubMedView ArticleGoogle Scholar - Hoeller D, Dikic I:
**Targeting the ubiquitin system in cancer therapy.***Nature*2009,**458**(7237):438–444. 10.1038/nature07960PubMedView ArticleGoogle Scholar - Butenko S, Chaovalitwongse WA, Pardalos PM: Clustering Challenges in Biological Networks. World Scientific, illustrated edition 2009;Google Scholar
- Brejová B, Vinar T, Li M:
**Pattern discovery: Methods and software.***Introduction to Bioinformatics*2003, 491–522.Google Scholar - Thompson JD, Higgins DG, Gibson TJ:
**CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.***Nucleic Acids Res*1994,**22**(22):4673–80. 10.1093/nar/22.22.4673PubMed CentralPubMedView ArticleGoogle Scholar - Notredame C, Higgins DG, Heringa J:
**T-Coffee: A novel method for fast and accurate multiple sequence alignment.***J Mol Biol*2000,**302**(1):205–17. 10.1006/jmbi.2000.4042PubMedView ArticleGoogle Scholar - Amarendran R Subramanian aMK, Morgenstern B: DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment. Algorithms Mol Biol 2008.,3(6):Google Scholar
- Durbin R, Eddy SR, Krogh A, Mitchison G:
*Biological Sequence Analysis: Probabilistic models of proteins and nucleic acids*. Cambridge University Press; 1998.View ArticleGoogle Scholar - Wang L, Jiang T:
**On the complexity of multiple sequence alignment.***Journal of Computational Biology*1994,**1**(4):337–348. 10.1089/cmb.1994.1.337PubMedView ArticleGoogle Scholar - Frith MC, Hansen U, Spouge JL, Weng ZP:
**Finding functional sequence elements by multiple local alignment.***Nucleic Acids Res*2004,**32**(1):189–200. 10.1093/nar/gkh169PubMed CentralPubMedView ArticleGoogle Scholar - Buhler J, Tompa M:
**Finding motifs using random projections.***J Comput Biol*2002.Google Scholar - Hideya Kawaji YT, Matsuda H:
**Graph-based clustering for finding distant relationships in a large set of protein sequences.***Bioinformatics*2004,**20:**243–252. 10.1093/bioinformatics/btg397PubMedView ArticleGoogle Scholar - Patwardhan R, Tang H, Kim S, Dalkilic M:
**An Approximate de Bruijn Graph Approach to Multiple Local Alignment and Motif Discovery in Protein Sequences.***Data Mining and Bioinformatics*2006,**4316:**158–169. 10.1007/11960669_14View ArticleGoogle Scholar - Pevzner P, Sze S:
**Combinatorial approaches to finding subtle signals in DNA strings.***In Proc ISMB*2000,**2000:**269–278.Google Scholar - Mandoiu I, Zelikovsky A:
*Bioinformatics Algorithms: Techniques and Applications*. Wiley Series in Bioinformatics, Wiley; 2008.View ArticleGoogle Scholar - Li M, Ma B, Wang L:
**Finding similar regions in many strings.***Journal of Computer and System Sciences*2002,**65:**73–96. 10.1006/jcss.2002.1823View ArticleGoogle Scholar - Evans PA, Smith A, Wareham HT:
**On the complexity of finding common approximate substrings.***Theoretical Computer Science*2003,**306**(3):407–430.View ArticleGoogle Scholar - Aleksandrushkina NI, Egorova LA:
**Nucleotide makeup of the DNA of thermophilic bacteria of the genus Thermus.***Mikrobiologiia*1978,**47**(2):250–252.PubMedGoogle Scholar - Jeong JC, Lin X, Chen XW:
**On Position-Specific Scoring Matrix for Protein Function Prediction.***IEEE/ACM Transactions on Computational Biology and Bioinformatics*2011,**8**(2):308–315.PubMedView ArticleGoogle Scholar - Chan SC, Wong AKC:
**Synthesis and Recognition of Sequences.***IEEE Trans on PAMI*1991,**13**(12):1245–1255. 10.1109/34.106998View ArticleGoogle Scholar - Wong AKC, Chiu DKY, Chan SC:
**Pattern Detection in Biomolecules Using Synthesized Random Sequence.***Journal of Pattern Recognition*1995,**29**(9):1581–1586.View ArticleGoogle Scholar - Sonnhammer EL, Eddy SR, Durbin R:
**Pfam: A Comprehensive Database of Protein Domain Families Based on Seed Alignments.***PROTEINS: Structure, Function, and Genetics*1997,**28:**405–420. 10.1002/(SICI)1097-0134(199707)28:3<405::AID-PROT10>3.0.CO;2-LView ArticleGoogle Scholar - Lee ESA, Wong AKC:
**Identifying Protein Binding Functionality of Protein Families by Aligned Pattern Clusters.***IEEE International Conference on Bioinformatics and Biomedicine*2012.Google Scholar - Wong AK, Zhuang D, Li GC, Lee ESA:
**Discovery of Delta Closed Patterns and Non-induced Patterns from Sequences.***IEEE Transactions on Knowledge and Data Engineering Journal*2012,**24**(8):1408–1421.View ArticleGoogle Scholar - Wong AKC, Liu TS, Wang CC:
**Statistical Analysis of Residue Variability in Cytochrome C.***Journal of Molecular Biology*1976,**102:**287–295. 10.1016/S0022-2836(76)80054-XPubMedView ArticleGoogle Scholar - Kranz R, Richard-Fogal C, Taylor J, Frawley E:
**Cytochrome c biogenesis: mechanisms for covalent modifications and trafficking of heme and for heme-iron redox control.***Microbiology and molecular biology reviews*2009,**73**(3):510–528. 10.1128/MMBR.00001-09PubMed CentralPubMedView ArticleGoogle Scholar - Stevens J, Daltrop O, Allen J, Ferguson S:
**C-type cytochrome formation: chemical and biological enigmas.***Accounts of chemical research*2004,**37**(12):999–1007. 10.1021/ar030266lPubMedView ArticleGoogle Scholar - Bairoch A:
**PROSITE: a dictionary of sites and patterns in proteins.***Nucleic Acids Research*1991,**19:**2241–2245. 10.1093/nar/19.suppl.2241PubMed CentralPubMedView ArticleGoogle Scholar - Sigrist CJ, Cerutti L, Castro ED, Langendijk-Genevaux PS, Bulliard V, Bairoch A, Hulo N:
**PROSITE, a protein domain database for functional characterization and annotation.***Nucleic acids research*2010,**38**(suppl 1):D161-D166.PubMed CentralPubMedView ArticleGoogle Scholar - Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, Holm L, Sonnhammer EL, Eddy SR, Bateman A:
**The Pfam protein families database.***Nucleic acids research*2010,**38:**D211–22. 10.1093/nar/gkp985PubMed CentralPubMedView ArticleGoogle Scholar - Peng J, Schwartz D, Elias JE, Thoreen CC, Cheng D, Marsischky G, Roelofs J, Finley D, Gygi SP:
**A proteomics approach to understanding protein ubiquitination.***Nature biotechnology*2003,**21**(8):921–926. 10.1038/nbt849PubMedView ArticleGoogle Scholar - Xu PP:
**Characterization of Polyubiquitin Chain Structure by Middle-down Mass Spectrometry.***Analytical chemistry*2008,**80**(9):3438–44. 10.1021/ac800016wPubMed CentralPubMedView ArticleGoogle Scholar - Ikeda F, Dikic I:
**Atypical ubiquitin chains: new molecular signals.***EMBO reports*2008,**9**(6):536–542. 10.1038/embor.2008.93PubMed CentralPubMedView ArticleGoogle Scholar - Arrowsmith CH, Bountra C, Fish PV, Lee K, Schapira M:
**Epigenetic protein families: a new frontier for drug discovery.***nature Reviews*2012,**11:**384–400. 10.1038/nrd3674PubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.