Finding a peptide whose theoretical spectrum has a maximum match to a measured experimental spectrum.

#### De novo peptide sequencing method

There are mainly two ways to solve peptide sequencing problems, one is database search, and the other is *de novo* method [57, 60]. The former one involves generating all 20*l* amino acid sequences of a certain length *l* and the theoretical spectrum related to each sequence, finding the maximal match among all the spectra [61–63]. Considering the number of possible sequences grows exponentially with the length of peptide sequences, the computing time would also increase exponentially. *De novo* sequencing which usually uses a spectrum graph model, on the other hand, dose not need to generate all the amino acid sequences, thus developing fast and drawing increasing attention in recent years [64–66]. Here, we introduce basic models and principles of this kind of method [5, 65]. Some recent improvements and advanced approaches can be found in [67–70].

In this method, a spectrum graph representing the experimental spectrum is constructed. Assuming that experimental spectrum *S* = *s*
_{
l
},…,*s*
_{
q
} consists of *N*–terminal ions. Here, we ignore *C*–terminal ions because we can build a similar model of *C*–terminal ions by changing *N*–terminal ions into *C*–terminal ions. Every mass of *s*
_{
t
} ∈ *S* (*t* = 1, 2,…, *q*) may have been created from a partial peptide by one of the *k* different ion types. In other words, each *s*
_{
t
} (*t* = 1, 2,…, *q*) corresponds to a spectrum of an ion, which is derived from some peptide *P*
_{
i
} (*i* = 1, 2,…, *n*) losing some small group *δ*
_{
j
} (*j* = 1, 2,…, *k*). However, we do not know what ion type of ∆ = {*δ*
_{1}, *δ*
_{2},…, *δ*
_{
k
}} brings the mass of *s*
_{
t
}, so we need to generate *k* different *guesses* for each mass in the experimental spectrum. Every guess corresponds to a hypothesis that, let *x* be the mass of some partial peptide, then *s*
_{
t
} = *x – δ*
_{
j
}, where *t* = 1, 2,…, *q* and *j* = 1, 2,…, *k*. Therefore, there are *k* different guesses of a partial peptide with mass *x* that *s*
_{
t
} + *δ*
_{1}
*s*
_{
t
} + *δ*
_{2},…, *s*
_{
t
} + *δ*
_{
k
} corresponding to the mass *s*
_{
t
} in experimental spectrum. That is to say, a partial peptide with mass *x* has *k* different possible conformations in this model.

After that, each mass in the experimental spectrum is transferred into a set consisting of *k* vertices in spectrum graph, corresponding to each possible ion type. The problem now can be solved by using graph theory. In particular, we use a directed acyclic graph (DAG) to represent the experimental spectrum. The vertices and edges of the graph are defined as follows.

*Vertex:* Each possible conformation of a partial peptide is represented by a vertex. The vertex for *δ*
_{
j
} of the mass *s*
_{
t
} is labeled with mass *s*
_{
t
} + *δ*
_{
j
} .

*Edge:* An directed edge is drawn from vertex *u* to *v* if the mass of *v* is larger than that of *u* by the mass of a single amino acid.

Now, if we add a vertex at 0 representing the starting vertex (with mass 0) and a vertex at *m* representing the parent peptide (with mass *M*), the peptide sequencing problem can be translated into a path (from 0 to *m*) finding problem in the resulting DAG. Specifically, if there exists an edge from *u* to *v*, the chain of amino acids will be extended by adding a chemical group whose mass is the mass difference between vertex *u* and *v*. Therefore, by finding a path from 0 to *m* in the DAG, amino acid chain increases gradually and the peptide sequence can be found eventually.

In addition, vertices of the resulting spectrum graph is a set of numbers

*s*
_{
t
} +

*δ*
_{
j
} representing potential masses of

*N*–terminal peptides adjusted by the ion type

*δ*
_{
j
} . Every mass

*s*
_{
t
} generates

*k* different vertices, denoted by

*V*
_{
t
}(

*s*), then

There is the possibility that *V*
_{
t
}(*s*) and *V*
_{
τ
}(*s*) may overlap when *s*
_{
t
} and *s*
_{
τ
} are close, where *s*
_{
t
}, *s*
_{
τ
} ∈ *S*. The set of vertices in a spectrum graph is therefore {*s*
_{
initial
}}⋃ *V*
_{1} ⋃ ⋯ ⋃ *V*
_{
q
} ⋃ {*s*
_{
final
}}, where *s*
_{
initial
} = 0 and *s*
_{
final
} = *m*.

The spectrum graph has at most *qk* + 2 vertices. We label the edge of the spectrum graph by amino acid whose mass is equal to the mass difference between two possible conformations (vertices). If we view vertices as putative *N*–terminal peptides, the edge from *u* to *v* implies that the *N*–terminal sequence corresponding to *v* can be obtained by extending the sequence at *u* by the amino acid that labels on the edge from *u* to *v*, where *u*,*v*∈*V*(*G*).

For any *i* ∈ [1, *n*], if *S* contains at least one ion type corresponding to every *N*–terminal partial peptide *P*
_{
i
} , we say that the spectrum *S* of a peptide sequence *P* = *p*
_{
l
} *… p*
_{
n
} is complete. The use of a spectrum graph is based on the fact that, for a complete spectrum, there exists a path of length *n* + 1 from *s*
_{
initial
} to *s*
_{
final
} in the spectrum graph that is labeled by *P*. This observation casts the peptide sequencing problem as one of finding the correct path in the set of all paths between two given vertices in a DAG. In addition, if the spectrum is complete, the correct path that we are finding will be the longest path in the graph usually [5].

#### Discussion and further improvement

In this section, we describe the *de novo* peptide sequencing problem and give an effective solution by a graph-theoretic method. The *de novo* method aims at inferring peptide sequences without using database, and the spectrum graph model solves this problem in a mathematical way. The solution successfully solves the problem by finding a longest path in a given spectrum graph. This kind of approach involves automatically interpreting the spectrum using the table of amino acids masses, and not relies on the completeness of database and effectiveness of searching algorithm, which the database method just relies on. Therefore, it usually costs less computation time, especially when the spectrum is with good quality.

However, this approach still has limitations. First, the success of finding the longest path in the graph relies on the completeness of mass spectrum, but in experiments, spectrum is always incomplete and combines with different kinds of noises, which makes the proposed approach hard to achieve. Second, finding the longest path in a given graph is an NP-complete problem which is difficult to find optimal solution. Third, when peptide breaks into MS/MS, it loses different kinds of small molecules, and considering all these losses needs a lot of vertices been created in the spectrum graph. When the number of vertices of the graph increases, computation time of solving this problem increases too, and even faster. At last, this kind of approach does not pay much attention to the peak intensity but using the *m/z* value only.

The performance of *de novo* peptide sequencing depends on the quality of the MS/MS spectra and the algorithms. When the spectra is complete or with high quality, *de novo* algorithm can find the correct sequences faster than the database search method, and also has the ability of finding new peptide which is not in the current database. Also, with advanced algorithm, *de novo* method could handle with spectra containing much noise, with missing peaks and so on. However, due to the limitation of tandem mass spectrometry, the database method is still the most popular and widely used one today. Some possible ways of improvements of de novo method are given below. First, when the spectrum is incomplete, we can add the missing ones by their complementary ions. Since any ion with a mass *X* in MS/MS, there should be an ion with mass *Y* such that *X* + *Y* = *M*, where *M* is the mass of the parent peptide. Thus we can add complementary ions back in an experimental spectral data set [71]. Second, we can consider effective algorithms on finding the longest path in a given graph such as dynamic programming and parallel approach. Third, this method can be partly solved by modifying the original model from finding global solution to possible local solutions. Some suboptimal algorithms can be considered, too [69]. Last but not least, a meaningful issue for the future research can be the combination of *de novo* method and other approaches, for example, database search [72].