Abstract
In this paper, we introduce a novel feature extraction model for protein sequence comparison. First we cluster 20
natural amino acids into 8 groups based on their physicochemical properties using K-Means algorithm, then a
36-dimensional feature vector is extracted from the frequency, the mean absolute error of the position of amino acids in
reduced amino acid sequences, and the order information of 20 amino acids in the original sequences. Finally, the
Euclidean distance is used to measure the similarity and evolutionary distance between protein sequences. The test
indicates that our method is fast and accurate for classifying and inferring the phylogeny of proteins.