Identifying protein complexes with fuzzy machine learning model

Xu, Bo; Lin, Hongfei; Wagholikar, Kavishwar B; Yang, Zhihao; Liu, Hongfang

doi:10.1186/1477-5956-11-S1-S21

Volume 11 Supplement 1

Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2012: Proteome Science

Research
Open access
Published: 07 November 2013

Identifying protein complexes with fuzzy machine learning model

Bo Xu^1,2,
Hongfei Lin¹,
Kavishwar B Wagholikar²,
Zhihao Yang¹ &
…
Hongfang Liu²

Proteome Science volume 11, Article number: S21 (2013) Cite this article

2215 Accesses
3 Citations
Metrics details

Abstract

Background

Many computational approaches have been developed to detect protein complexes from protein-protein interaction (PPI) networks. However, these PPI networks are always built from high-throughput experiments. The presence of unreliable interactions in PPI network makes this task very challenging.

Methods

In this study, we proposed a Genetic-Algorithm Fuzzy Naïve Bayes (GAFNB) filter to classify the protein complexes from candidate subgraphs. It takes unreliability into consideration and tackles the presence of unreliable interactions in protein complex. We first got candidate protein complexes through existed popular methods. Each candidate protein complex is represented by 29 graph features and 266 biological property based features. GAFNB model is then applied to classify the candidate complexes into positive or negative.

Results

Our evaluation indicates that the protein complex identification algorithms using the GAFNB model filtering outperform original ones. For evaluation of GAFNB model, we also compared the performance of GAFNB with Naïve Bayes (NB). Results show that GAFNB performed better than NB. It indicates that a fuzzy model is more suitable when unreliability is present.

Conclusions

We conclude that filtering candidate protein complexes with GAFNB model can improve the effectiveness of protein complex identification. It is necessary to consider the unreliability in this task.

Background

A protein complex is a group of two or more associated polypeptide chains. Proteins in a protein complex are linked by non-covalent protein-protein interactions (PPIs) and together participate a certain biological process. [1]. Protein complexes are a cornerstone of many biological processes and together they perform a vast array of biological functions [1]. So identifying protein complexes is crucial to understand the principles of cellular organization and predicting protein functions.

A number of computational methods can be used to detect protein complexes from a PPI network[2], a graphical map of an entire organism's interactome which is constructed from PPI knowledge base by considering individual proteins as nodes, and the existence of a physical interaction between a pair of proteins as a link. For example, CMC (clustering-based on maximal cliques)[3] discovers complexes from the weighted PPI network based on the maximal cliques. COACH[4] is a core- attachment[5] based method to detect protein complexes from PPI networks, where protein-complex cores from the neighbourhood graphs are mined and then formed protein complexes by including attachments into cores. Many graph-clustering methods can obtain a number of candidate protein complexes. However, the precision of these existing methods are only nearly 0.4. They got many false positive protein complexes in their results. Hence, classifying the true protein complexes from these results is a best way to improve the performance of protein complex detection methods. L. Chen, et al try to classify the protein complexes from candidate subgraphs with enriched features[6], where each protein complex is represented with a feature vector derived from the corresponding complex graph and biological properties of the constituent proteins. However, the current PPI knowledge base generally is built from high-throughput techniques, such as mass spectrometry and yeast two-hybrid assays. The PPI information gathered can be unreliable and incomplete[7]. The common classifiers may have limitedness due to the presence of noise in PPI network. To address the noise issue, our previous method [8] proposed a genetic algorithm fuzzy Naïve Bayes (GAFNB) model to do the classification. For improving the performance of identifying protein complexes, here we integrated GAFNB model as a filter in the process of protein complexes detection. We first got candidate protein complexes based on existed protein complexes detection methods. Each candidate subgraph is represented by a feature vector that includes 29 graph features and 266 biological property based features [6]. Then the genetic algorithm fuzzy Naïve Bayes (GAFNB) model is trained to classify candidate protein complexes into positive or negative using positive protein complexes determined through experiments and negatives generated randomly. After filtering the results of protein complexes detection methods through GAFNB model, the precision of existing methods are improved.

The rest of the paper is organized as follows: In the method section, we present a general framework of identifying protein complexes with our GAFNB filter. The experiment is described next. Experimental results and discussion are presented at last.

Methods

For a given organism, the proposed protein complex identification approach contains two steps (Figure 1). First step is to detect candidate protein complexes through some state-of-the-art protein complex detection algorithms. Second step is to filter the candidate protein complexes by GAFNB model. In the following, we first introduce the two state-of-the-art protein complex detection algorithms for identifying protein complexes. Then detailed GAFNB model is presented.

Candidate protein complexes identification algorithms

We implement two of the state-of-the-art protein complex identification algorithms here: COACH and CMC. The results of these existing computational methods provide candidate protein complexes for filtering.

COACH[4] is a core-attachment[5] based method to detect protein complexes from PPI networks, where protein-complex cores from the neighbourhood graphs are mined and then formed protein complexes by including attachments into cores. Proteins within the same protein-complex core detected by this method have high functional similarity and tend to be co-localized.

CMC[3] discovers complexes from the weighted PPI network based on the maximal cliques. It first uses an iterative scoring method (AdjustCD) to assign weight to protein pairs, and the weight of a protein pair indicates the reliability of the interaction between the two proteins. Then generates all the maximal cliques from the weighted PPI networks. Finally removes or merges highly overlapped clusters based on their interconnectivity to get protein complexes.

GAFNB model

After obtaining the candidate protein complexes, we need a filter to classify the candidates into positive or negative. However, the feature values in candidates are unreliable because PPIs are generally obtained from high-throughput experiments. The traditional classifier may not be suitable for this task. Our previous work has shown GAFNB can handle unreliable information in features[8–11]. Hence we applied GAFNB model to filter the candidates for improving performance of protein complexes detection.

A. Fuzzy Certain Feature Membership (FCFM)

Because PPI data has some false positives and false negatives, the features value of protein complexes based on the PPI data can be uncertain. For example, the density of candidate protein complexes is calculated by edges (PPI) in the subgraph. So the density value is not certain. We call such features as Fuzzy Features. In contrast the reliable features are Certain Features. The uncertainty about the values for a feature can be represented as a matrix, wherein an element is the membership of a Certain feature value in a Fuzzy value (Table 1). As shown in Table 1, X₁₁ is the membership of Certain feature value of density (<0.5) in Fuzzy feature value (<0.5). The matrices for all features are orthogonal. We refer to such matrices as Fuzzy Certain Feature Membership (FCFM).

Table 1 Fuzzy certain feature membership

Full size table

B. GAFNB model

Let P = {p_α} be a set of candidate protein complexes with features A = {α_i} and D = {d _k} a set of classes for candidate protein complexes. Let I be a subset of S represents some candidate protein complexes. The model is trained in the following steps using positive protein complexes determined through experiments and negatives generated randomly.

1) Compute probabilities of certain feature values

We define the conditional probability of α_i = v _ij for class d _k using Laplace correction [12] as follow:

p (a_{i} = v_{i j} | d_{k}) = \frac{f (v_{i j} \cap d_{k}) + 1}{f (d_{k}) + | a_{i} |}

(1)

where {v _ij} is the set of values for feature α_i, f(v _ij∩ d _k) is the frequency count of instances in class d _k having α _i = v _ij , f(d _k) is the number of instances in the dataset belonging to class d_k, and |α_i| is the number of values possible for feature α _i.

2) Compute optimal FCFM using Genetic Algorithm (GA)

FCFM is a set of matrices that represent the features as mentioned above [11]. Each feature has one corresponding matrix, and an element in the matrix is the membership of a Certain feature value in a Fuzzy value. We first create a set of FCFMs. Such FCFMs are referred to as the population. The matrix of corresponding FCFM is initialized to an identity one in the population.

The Genetic Algorithm is applied on the FCFM population for calculating the membership of fuzzy feature value in the certain feature value. Three basic steps of GA are performed in several iterations (see Figure 2).We consider each iteration as a generation. The basic steps of selection, crossover and mutation are described below:

a.
Selection. Each FCFM gets a score from a fitness function. The high ranking ones are selected (see Figure 3). The detailed of fitness function is defined in our previous paper [11].We performed n-fold cross validation to obtain a set of n accuracy measures from Naïve Bayesian model. The probability of Fuzzy feature is calculated by the probability of Certain feature combining with the given FCFM. The mean and standard deviation of the classification accuracies is computed for calculating the score of FCFM as follow:

s c o r e = m e a n - s t a n d a r d d e v i a t i o n s

(2)

b.
Crossover. The selected FCFMs are referred to as parents. The corresponding matrices are randomly combined from two parents to generate new members (children). Then the FCFMs who were not selected are replaced by child FCFMs. The instance of crossover operations of FCFM is shown in Figure 4. Two attributes density and mean degree of candidate protein complexes have Fuzzy feature value in this task.

c.
Mutation. An element of FCFM matrix is randomly selected and altered to a random value in the interval [0, 1]. The other elements in the same row also need to change by calculating for maintaining the orthogonality of the matrix. The number of mutations performed is determined by a parametric study. Also take the feature density and mean degree for examples, Figure 5 illustrates the mutation operations on the FCFM.

In summary, new population members are generated after each iteration. Figure 6 described the instances generated new population in the second iteration of GA.Finally, the scores of the population members converge to a constant value after some iterations. The FCFM with the highest score is selected for computing probabilities of fuzzy events.

3) Compute probabilities of fuzzy events

The membership value of Certain feature values v _ix in Fuzzy feature value is denoted as μ_{v ij} v _ix[9, 10]. Conditional probabilities of fuzzy-feature values (v _ij) for particular classes (d _k), are calculated as,

P ({\underline{a}}_{i} = {\underline{v}}_{i j} | d_{k}) = \sum_{x} P (a_{i} = v_{i x} | d_{k}) μ_{{\underline{v}}_{i j}} v_{i x}

(3)

Marginal probability of fuzzy feature value v _ij is,

P ({\underline{v}}_{i j}) = \sum_{x} P (v_{i x}) μ_{{\underline{v}}_{i j}} v_{i x}

(4)

4) Inference

When the feature of instances (candidate protein complexes) is fuzzy $(\underline{I})$ , the posterior probability for a class of this instance (d _k) is calculated using,

P (d_{k} | \underline{I}) = P (d_{k}) \frac{\prod_{{\underline{v}}_{i j} \in \underline{I}} P ({\underline{v}}_{i j} | d_{k})}{P (\underline{I})}

(5)

Since the denominator P( I ), is common for all candidate protein complexes, it is dropped. Note that the posterior probability is directly proportional to the label score.

P (d_{k} | \underline{I}) \propto P (d_{k}) \prod_{{\underline{v}}_{i j} \in \underline{I}_{}} P ({\underline{v}}_{i j} | d_{k})

(6)

Experiments

We plug in the GAFNB model as a filter in the process of protein complexes detection for improving the performances. For evaluating ability of dealing with fuzzy feature value, we first compared GAFNB model with Naïve Bayes on two datasets. One ten-fold cross validation run was performed for both models on each of the datasets. The framework of our study is illustrated in Figure 7. Next, we combined GAFNB model with the state-of-the-art methods of protein complexes detection (CMC and COACH) to illustrate the utility of GAFNB filter.

Experimental data

We downloaded yeast protein interaction data from DIP [13] with a total of 17,201 PPI pairs. The protein complex data was downloaded from a public repository located at http://www.cs.cmu.edu/~qyj/SuperComplex. It consists of 493 protein complexes from MIPS [14] and TAP-MS [15] (size >2).

Evaluation matrix

Since the benchmark of protein complexes is not complete and lots of negative protein complexes in the real world, we prefer to get more true protein complexes as fast as possible. The Lift [16] measure is a good choice for evaluating this task. When ranking the prediction score of a classifier, the ratio of known positives in top n is called estimated precision (EP). The baseline precision (BP) is the ratio of the known positives over the total number of samples in the data set. The Lift is defined as follows,

L i f t = E P / B P

(7)

which shows how fast the classifier obtains positives.

Besides the Lift value, we also chose the Receiver operating characteristic (ROC) curve and the area under the ROC curve (AUC) values for evaluating our model [17]. ROC curve is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied.

We followed existing approaches [4, 18, 19] to evaluate the experimental performance of protein complexes identification. Equation 8 calculates the neighborhood affinity score NA(p,b) between a predicted cluster p∈P and a real complex b∈B, where P is the set of predicted complexes by a computational method and B is the set of real ones in the benchmark.

N A (p, b) = \frac{{|V_{p} \cap V_{b}|}^{2}}{|V_{p}| \times |V_{b}|}

(8)

In equation 2, |V_p| is the number of proteins in the predicted complexes and |V_b| is the number of proteins in the real complex. If NA(p,b)≥ω, a real complex and a predicted complex are considered to be matching (ω is usually set as 0.20 or 0.25) [2].

After all real complexes and predicted clusters have their best match calculated according to their NA scores, precision, recall, and F-measure are applied to assess the methods:

N_{c p} = |\{p | p \in p, \exists b \in B, N A (p, b) \geq ω\}|

(9)

N_{c b} = |\{b | b \in B, \exists p \in P, N A (p, b) \geq ω\}|

(10)

P r e c i s i o n = \frac{N_{c p}}{|P|}, R e c a l l = \frac{N_{c b}}{|B|}

(11)

F = 2 \times P r e c i s i o n \times R e c a l l / (P r e c i s i o n + R e c a l l)

(12)

N_cp is the number of predicted complexes that match at least 1 real complex, and N_cb is the number of real complexes that match at least 1 predicted complex [2].

GAFNB performance evaluation

For training our model, we need the negative data. However, it is rare to find a confirmed report of non-protein complexes. Hence, we randomly selected proteins in the PPI network for generating negative protein complexes. We evaluate our model on two datasets: a balance dataset, containing 493 positive and 493 negative protein complexes, and an unbalance dataset having 493 positive and 9,878 negative protein complexes. The ratio of positives to negatives is 20:1. It represents the real life scenario where positive protein complexes are very rare.

Following a previous study [6], each protein complex is represented by a 295- dimensional feature vector. These include 29 graph features and 266 biological property based features. The graph features are extracted from the subgraph which formed by constituent proteins in the sample. The biological properties include biochemical properties, protein length and physicochemical properties. Biochemical properties include amino acid compositions and secondary structure, while physicochemical properties include hydrophobicity, normalized van der Waals volume, polarity, polarizability and solvent accessibility(for details, please see ref [6]). Let a complex consists of n proteins, the mean and maximum biological feature values of n proteins are taken as corresponding complex feature values.

Feature selection is the process of selecting a subset of relevant features for use in model construction. It can improve model interpretability, shorten training time and enhance generalisation by reducing overfitting. There are two common categories of feature selection algorithms: filters and wrapper. Filter methods produce a feature set which is not specific type of predictive model, such as information gain, chi-square test. It evaluates each feature individually. While Wrapper methods usually provide the best performing feature set for that particular type of model. Since the model is fixed in our study, we chose wrapper method to select features. However, it is very computationally intensive. If the number of features is n, the number of possible feature sets is 2ⁿ. Hence many popular search approaches use greedy hill climbing and best first, which iteratively evaluates a candidate subset of features, then modifies the subset and evaluates if the new subset is an improvement over the old [20]. In our study, we used WEKA's Wrapper selection [21] to find a proper feature subset for Naïve Bayes model. It started from the empty set of features and used a forward best first search with a stopping criterion as five consecutive fully expanded non- improving.

Protein complexes identification

We chose 2 different state-of-the-art methods to get candidate protein complexes for GAFNB filtering. CMC and COACH are implemented on DIP network. For COACH, the argument was set to 0.225, as mentioned in their paper as mentioned in their paper[4]. CMC is implemented on a revised and weighted network by AdjustCD, the top 10000 PPI pairs are selected and the two arguments were both set to 0.25. We chose a balance dataset, containing 493 positive and 493 negative protein complexes for our experiment. We selected One five-fold cross validation run was carried out for GAFNB model. Each time four-fold is for training model and one fold is for testing final performance of protein complexes. In the evaluation of protein complexes identification, we first filtered out complexes whose NA score are above 0.5 in matching the four fold complexes, then calculate the performance of the 1 fold data. The average performance is calculated for comparison.

Results and discussion

GAFNB performance evaluation

A. Feature selection

The dataset is randomly split into 10-sets. Each set is selected in turn as the test set and the remaining sets are combined to form the training set for WEKA Wrapper algorithm. Hence, we had ten optimal features sets based on Naïve Bayes classifiers for each dataset (balance dataset and unbalance dataset). Figure 8 shows the overlap of ten optimal feature subsets based on balance dataset. Only the feature weight edge variance with missing edge is selected for 10 times. The similar results in unbalance data as shown in Figure 9. Three Features (weight edge mean with missing edge, topological change 0.3 0.4 and degree max) are selected for 10 times. All this features that selected for ten times are graph topological features. While biological property features are different in each run. This is probably because some of them are correlated and sensitive to training data.

B. Evaluation results

Ten-fold cross validation was run for evaluating the performance of GAFNB model. Table 2 show the Lift values of NB, GAFNB with 10, 30 and 50 generations on balance dataset. The Lift value of GAFNB with 50 generations is always higher than that for NB on balance dataset.GAFNB-50 got 2.01 Lift value in top 20, it significantly improved than NB. The more trained generations, the larger Lift value is. The reason is probably that the models given more generations are more optimized. The similar results are also obtained in unbalance dataset (Table 3). When the number of generations is above 10, the performance of GAFNB is always better than NB. We also evaluated our model using AUC measure (Table 4). The AUC of GAFNB is also better than that of NB both in balance and unbalance data. Increasing with the number of generations, the AUC of GAFNB gets better in balance data. While it is different for unbalance data, the AUC of GAFNB-30 is greater than that of GAFNB-50. It is possibly because the model was over fitted or a particular cross-validation set might have been localized to a local minimum. The ROC curve also reflected the better performance of GAFNB as shown in Figure 10. In the unbalance data, ROC curve of GAFNB is always higher than NB. While in the balance data, the true positive rate (TPR) of GAFNB is above that of NB when false positive rate (FPR) > 0.1.In summarize, all this indicates that performance of GAFNB is better than NB. A fuzzy model is more suitable when unreliability is present.

Table 2 Lift value of GAFNB and NB in small balance data

Full size table

Table 3 Lift value of GAFNB and NB in large unbalance data

Full size table

Table 4 AUC of NB and GAFNB in balance data and unbalance data

Full size table

Protein complex identification

The comparison results show that the precision and F-value for COACH increased from 0.3387 to 0.4852 and 0.4465 to 0.5574, respectively, when filtering its results with GAFNB model over original ones (Table 5, Figure 11). However, there was a slight decrease in the recall (from 0.6551 to 0.6548). The generation value is chose 30 here. The feature subset was selected by WEKA based on balance data as mentioned in Figure 8. After filtering results of CMC with GAFNB model, the precision and F-value of CMC increased from 0.4055 to 0.496 and 0.4984 to 0.5611, respectively. There was also a slight decrease in the recall (from 0.6466 to 0.6459) when using GAFNB filtering (Table 5, Figure 12). Generally, the Precision and F- measure increased when filtering them with GAFNB model. It indicates that GAFNB model can filter out some false positives. Since existing protein complexes identification methods only consider graph structure of PPI network, GAFNB model incorporate many biology features to filter out false positives. While CMC predicted only 365 candidate protein complexes, filtering out some complexes can harm the recall. The more candidate protein complexes from existed identification methods, the better performance of GAFNB filter has.

Table 5 Performance Comparison CMC, CMC+GAFNB, COACH and COACH+GAFNB

Full size table

Evaluation of the contribution of each type feature toward protein complex identification will be done in the future. Some of our predicted complexes do not match any complex in the benchmark complex set. We found that the predicted complexes have high biological significance, as computed using P value, and high local density, as shown in Figure 13. They may be true complexes that are as yet undiscovered. The P values were calculated with the SGD's GO::TermFinder [22]. A low P value of a predicted complex generally indicates that the collective occurrence of these proteins in a complex does not occur merely by chance, and thus the predicted complex has a high statistical probability of being real.

Conclusions

In this paper, we plug a GAFNB model in the process of protein complexes detection. The candidate protein complexes from existed methods are filtered by our GAFNB model. Results show that the performance of protein complexes identification methods are improved using our GAFNB filter and the GAFNB model is more suitable when unreliability is present. In the future, we will apply our genetic algorithm fuzzy Naïve Bayes model to deal with the other task with fuzzy features.

References

Hartwell LH, et al.: From molecular to modular cell biology. Nature 1999, 402: 47. 10.1038/46972
Article Google Scholar
Li X, Wu M, Kwoh CK, Ng SK: Computational approaches for detecting protein complexes from protein interaction networks: a survey. BMC genomics 2010,11(Suppl 1):S3. 10.1186/1471-2164-11-S1-S3
Article PubMed Central PubMed Google Scholar
Liu G, Wong L, Chua HN: Complex discovery from weighted PPI networks. Bioinformatics 2009,25(15):1891–1897. 10.1093/bioinformatics/btp311
Article CAS PubMed Google Scholar
Wu M, Li X, Kwoh CK, Ng SK: A core-attachment based method to detect protein complexes in PPI networks. BMC bioinformatics 2009,10(1):169. 10.1186/1471-2105-10-169
Article PubMed Central PubMed Google Scholar
Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dümpelfeld B: Proteome survey reveals modularity of the yeast cell machinery. Nature 2006,440(7084):631–636. 10.1038/nature04532
Article CAS PubMed Google Scholar
Chen L, et al.: Identifying protein complexes using hybrid properties. Journal of proteome research 2009, 8: 5212–5218. 10.1021/pr900554a
Article CAS PubMed Google Scholar
Chen J, et al.: Discovering reliable protein interactions from high- throughput experimental data using network topology. Artificial Intelligence in Medicine 2005, 35: 37–47. 10.1016/j.artmed.2005.02.004
Article PubMed Google Scholar
Xu B, Lin H, Yang Z, Wagholikar KB, Liu H: Classifying protein complexes from candidate subgraphs using fuzzy machine learning model. In Bioinformatics and Biomedicine Workshops (BIBMW), 2012 IEEE International Conference on: 2012. IEEE; 2012:640–647.
Chapter Google Scholar
Wagholikar K, et al.: Can Fuzzy modifiers improve Medical Decision Support ? In IDAMAP Workshop, AMIA10 annual conference. Washington D.C., USA.,; 2010.
Google Scholar
Wagholikar K: Genetic Algorithm and Fuzzy modeling for Medical Decision Support. Ph.D, University of Pune, Pune, India; 2010.
Google Scholar
Wagholikar K, et al.: GA-Fuzzy Naive Bayes method for Medical Decision Support. In World Conference on Soft Computing. San Francisco; 2011:209.
Google Scholar
Xu B, Lin H, Yang Z, Wagholikar KB, Liu H: Classifying protein complexes from candidate subgraphs using fuzzy machine learning model. In Bioinformatics and Biomedicine Workshops (BIBMW), 2012 IEEE International Conference on: 2012. IEEE; 2012:640–647.
Chapter Google Scholar
Kohavi R, et al.: Improving simple bayes. The Nineth European Conference on Machine Learning 1997, 78–87.
Google Scholar
Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM, Eisenberg D: DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic acids research 2002,30(1):303–305. 10.1093/nar/30.1.303
Article CAS PubMed Central PubMed Google Scholar
Mewes HW, et al.: MIPS: analysis and annotation of proteins from whole genomes. Nucleic acids research 2004, 32: D41. 10.1093/nar/gkh092
Article CAS PubMed Central PubMed Google Scholar
Gavin AC, et al.: Proteome survey reveals modularity of the yeast cell machinery. Nature 2006, 440: 631–636. 10.1038/nature04532
Article CAS PubMed Google Scholar
Coppock DS: Why Lift? Data Modeling and Mining. Information Management Online 2002, 5329–1.
Google Scholar
Fawcett T: An introduction to ROC analysis. Pattern recognition letters 2006, 27: 861–874. 10.1016/j.patrec.2005.10.010
Article Google Scholar
Chua HN, Ning K, Sung WK, Leong HW, Wong L: Using indirect protein- protein interactions for protein complex prediction. Journal of Bioinformatics and Computational Biology 2008,6(03):435–466. 10.1142/S0219720008003497
Article CAS PubMed Google Scholar
Geva G, Sharan R: Identification of protein complexes from co- immunoprecipitation data. Bioinformatics 2011,27(1):111–117. 10.1093/bioinformatics/btq652
Article CAS PubMed Central PubMed Google Scholar
John GH, et al.: Irrelevant features and the subset selection problem. In Proceedings 11th International Conference on Machine Learning. New Brunswick, Canada; 1994:121–129.
Google Scholar
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH: The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 2009,11(1):10–18. 10.1145/1656274.1656278
Article Google Scholar
Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G: GO:: TermFinder--open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics 2004,20(18):3710–3715. 10.1093/bioinformatics/bth456
Article CAS PubMed Central PubMed Google Scholar

Download references

Acknowledgements

This study was supported by an award from China Scholarship Council, Ministry of Education, grants from National Science Foundation ABI: 0845523, National Institute of Health R01LM009959A1, the Natural Science Foundation of China (No. 60673039, 61070098 and 61272373), the National High Tech Research and Development Plan of China (No. 2006AA01Z151), the Fundamental Research Funds for the Central Universities (No.DUT10JS09) and Liaoning Province Doctor Start up Fund (No. 20091015).

Declarations

The publication costs for this article were funded by Hongfei Lin.

This article has been published as part of Proteome Science Volume 11 Supplement 1, 2013: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2012: Proteome Science. The full contents of the supplement are available online at http://www.proteomesci.com/supplements/11/S1.

Author information

Authors and Affiliations

School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, China
Bo Xu, Hongfei Lin & Zhihao Yang
Department of Health Science Research, Mayo Clinic, Rochester, MN, USA
Bo Xu, Kavishwar B Wagholikar & Hongfang Liu

Authors

Bo Xu
View author publications
You can also search for this author in PubMed Google Scholar
Hongfei Lin
View author publications
You can also search for this author in PubMed Google Scholar
Kavishwar B Wagholikar
View author publications
You can also search for this author in PubMed Google Scholar
Zhihao Yang
View author publications
You can also search for this author in PubMed Google Scholar
Hongfang Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bo Xu.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

BX conceived of the study, participated in its design, carried out all experiments, and drafted the manuscript. HLin and ZY reviewed manuscript. KBW participated in GAFNB experiments. HLiu helped draft the manuscript. All authors read and approved the final manuscript.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Xu, B., Lin, H., Wagholikar, K.B. et al. Identifying protein complexes with fuzzy machine learning model. Proteome Sci 11 (Suppl 1), S21 (2013). https://doi.org/10.1186/1477-5956-11-S1-S21

Download citation

Published: 07 November 2013
DOI: https://doi.org/10.1186/1477-5956-11-S1-S21

Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2012: Proteome Science

Identifying protein complexes with fuzzy machine learning model

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Candidate protein complexes identification algorithms

GAFNB model

A. Fuzzy Certain Feature Membership (FCFM)

B. GAFNB model

1) Compute probabilities of certain feature values

2) Compute optimal FCFM using Genetic Algorithm (GA)

3) Compute probabilities of fuzzy events

4) Inference

Experiments

Experimental data

Evaluation matrix

GAFNB performance evaluation

Protein complexes identification

Results and discussion

GAFNB performance evaluation

A. Feature selection

B. Evaluation results

Protein complex identification

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Proteome Science

Contact us