Background: A large number of alignment–free techniques of graphical representation and numerical characterization (GRANCH) of bio-molecular sequences have been proposed in the recent past years, but the relative efficacy of these methods in determining the degree of similarities and dissimilarities of such sequences have not been ascertained. Objective: Our objective is to make an assessment of the relative efficacy of these methods in determining the degree of similarities and dissimilarities of bio-molecular sequences. Method: We have chosen 7 published/communicated methods that represent various classes of GRANCH techniques and computed the descriptors that are expected to characterize similarities and dissimilarities in several sets of gene sequences. We critically appraise the different methods and determine which of these yield non-redundant structural information that could be used to compute different properties of the sequences, and which are correlated enough to one another so that using the simplest representative of the group would suffice. We also do a principal component analysis (PCA) to determine how the variances in the calculated sequence descriptors are explained by the computed principal components (PCs). Results: We found that some of the descriptors are strongly correlated implying a commonality of structural information encoded by them while others are distinctly separate. The PCA results show that the first three PC’s explain >97% of the variances. Conclusion: We found that some mathematical DNA descriptors calculated by a few of these techniques correlate strongly with one another implying a redundancy in the structural information quantified by those descriptors; others are not strongly correlated with one another suggesting that they encode non-redundant sequence information. From this and our PCA results, our recommendation would be to use minimally correlated set of descriptors or orthogonal descriptors like PCs derived from the descriptor set for the characterization of nucleic acid structure and function.
- DNA descriptors
- Descriptor correlations
- Gene sequences
- Graphical representation and numerical characterization (GRANCH) techniques
- Principal component (PC)
- Principal components analysis (PCA)