SUMAC: Constructing phylogenetic supermatrices and assessing partially decisive taxon coverage

William A. Freyman

Research output: Contribution to journalArticlepeer-review

21 Scopus citations

Abstract

The amount of phylogenetically informative sequence data in GenBank is growing at an exponential rate, and large phylogenetic trees are increasingly used in research. Tools are needed to construct phylogenetic sequence matrices from GenBank data and evaluate the effect of missing data. Supermatrix Constructor (SUMAC) is a tool to data-mine GenBank, construct phylogenetic supermatrices, and assess the phylogenetic decisiveness of a matrix given the pattern of missing sequence data. SUMAC calculates a novel metric, Missing Sequence Decisiveness Scores (MSDS), which measures how much each individual missing sequence contributes to the decisiveness of the matrix. MSDS can be used to compare supermatrices and prioritize the acquisition of new sequence data. SUMAC constructs supermatrices either through an exploratory clustering of all GenBank sequences within a taxonomic group or by using guide sequences to build homologous clusters in a more targeted manner. SUMAC assembles supermatrices for any taxonomic group recognized in GenBank and is optimized to run on multicore computer systems by parallelizing multiple stages of operation. SUMAC is implemented as a Python package that can run as a stand-alone command-line program, or its modules and objects can be incorporated within other programs. SUMAC is released under the open source GPLv3 license and is available at https://github.com/wf8/sumac.

Original languageEnglish (US)
Pages (from-to)263-266
Number of pages4
JournalEvolutionary Bioinformatics
Volume11
DOIs
StatePublished - Nov 30 2015

Bibliographical note

Publisher Copyright:
© the authors, publisher and licensee Libertas Academica Limited.

Keywords

  • Data-mining
  • Decisiveness
  • GenBank
  • Partial taxon coverage
  • Phylogenetics
  • Supermatrix

Fingerprint

Dive into the research topics of 'SUMAC: Constructing phylogenetic supermatrices and assessing partially decisive taxon coverage'. Together they form a unique fingerprint.

Cite this