Identification of protein functions using a machine-learning approach based on sequence-derived properties

Table 5 Accuracy of predictions using training and blind test datasets with the SVM and random forest methods

Category	Protein class	Training set		Test set		SVM_FF		SVM_CFS		RF_FF		RF_CFS
		Positive	Negative	Positive	Negative	Train	Test	Train	Test	Train	Test	Train	Test
Biological process	Transport	2,824	3,583	298	414	73.26	71.34	94.38	93.53	93.14	92.41	94.66	94.24
	Transcription	3,644	3,872	415	421	87.78	85.04	96.62	96.65	94.25	94.61	94.65	94.73
	Translation	139	1,886	16	210	98.81	98.67	98.37	97.78	97.87	96.90	98.07	97.34
	Gluconate utilisation	53	420	7	46	98.73	98.11	98.94	98.11	97.04	98.11	98.30	100
	Amino acid biosynthesis	2,769	3,970	289	460	73.55	76.63	90.28	92.12	95.69	96.12	96.29	96.12
	Fatty acid metabolism	601	3,445	81	369	90.58	87.55	94.19	92	95.99	94.88	96.93	95.77
Molecular function	Acetylcholine receptor inhibitor	93	1,840	10	205	100	99.53	100	100	100	100	100	100
	G-protein coupled receptor	2,571	3,828	263	448	76.04	77.07	98.76	98.17	96.62	97.74	97.60	97.46
	Guanine nucleotide-releasing factor	335	3,994	35	446	98.96	98.75	99.51	98.96	98.49	98.96	98.98	98.54
Cellular component	Fibre protein	42	1,266	6	140	99.84	99.31	99.92	99.31	99.38	99.31	99.84	98.63
Domain	Transmembrane	1,904	3,930	223	426	80.01	79.81	97.15	97.84	96.02	97.84	96.46	97.38

The accuracy of predictions using the training dataset was determined when building the classification model using 10-fold cross validation, and the accuracy of predictions using the test dataset was determined using the built model. The accuracies of predictions for all the training and test datasets are presented to demonstrate a good balance between overfitting and underfitting. Positive: number of positive samples; negative: number of negative samples; FF: full features; CFS: correlation-based feature subset selection method. The bold values mean the highest values among four methods.

ISSN: 1477-5956