Skip to main content

Table 2 The number of variable-length k-grams and the rate of hash collisions for various hash sizes.

From: Protein sequence classification using feature hashing

Value of b

non-plant

plant

psortNeg

 

# features

Collisions %

# features

Collisions %

# features

Collisions %

222

155017

0

111544

0

124389

0

220

153166

1.21

110236

1.18

122894

1.22

219

147223

5.29

107299

3.95

118871

4.64

218

132754

16.30

99913

11.43

109535

13.22

217

99764

45.04

82141

31.38

87618

35.66

216

59358

78.53

53616

64.29

55555

68.85

215

32474

95.80

31788

89.56

32075

92.02

214

16384

100

16384

100

16384

100

  1. The number of unique features (denoted as # features) and the rate of collisions on non-plant, plant, and psortNeg data sets, respectively, for variable length k-gram representations, where k varies from 1 to 4.