match93n3

Abstract

In this paper, we introduce a novel feature extraction model for protein sequence comparison. First we cluster 20 natural amino acids into 8 groups based on their physicochemical properties using K-Means algorithm, then a 36-dimensional feature vector is extracted from the frequency, the mean absolute error of the position of amino acids in reduced amino acid sequences, and the order information of 20 amino acids in the original sequences. Finally, the Euclidean distance is used to measure the similarity and evolutionary distance between protein sequences. The test indicates that our method is fast and accurate for classifying and inferring the phylogeny of proteins.