The amount of phylogenetically informative sequence data in GenBank is growing at an exponential rate, and large phylogenetic trees are increasingly used in research. Tools are needed to construct phylogenetic sequence matrices from GenBank data and evaluate the effect of missing data. Supermatrix Constructor (SUMAC) is a tool to data-mine GenBank, construct phylogenetic supermatrices, and assess the phylogenetic decisiveness of a matrix given the pattern of missing sequence data. SUMAC calculates a novel metric, Missing Sequence Decisiveness Scores (MSDS), which measures how much each individual missing sequence contributes to the decisiveness of the matrix. MSDS can be used to compare supermatrices and prioritize the acquisition of new sequence data. SUMAC constructs supermatrices either through an exploratory clustering of all GenBank sequences within a taxonomic group or by using guide sequences to build homologous clusters in a more targeted manner. SUMAC assembles supermatrices for any taxonomic group recognized in GenBank and is optimized to run on multicore computer systems by parallelizing multiple stages of operation. SUMAC is implemented as a Python package that can run as a stand-alone command-line program, or its modules and objects can be incorporated within other programs. SUMAC is released under the open source GPLv3 license and is available at https://github.com/wf8/sumac.
Bibliographical notePublisher Copyright:
© the authors, publisher and licensee Libertas Academica Limited.
- Partial taxon coverage