Can deliberately incomplete gene sample augmentation improve a phylogeny estimate for the advanced moths and butterflies (hexapoda: Lepidoptera)?

Soowon Cho; Andreas Zwick; Jerome C. Regier; Charles Mitter; Michael P. Cummings; Jianxiu Yao; Zaile Du; Hong Zhao; Akito Y. Kawahara; Susan Weller; Donald R. Davis; Joaquin Baixeras; John W. Brown; Cynthia Parr

doi:10.1093/sysbio/syr079

Can deliberately incomplete gene sample augmentation improve a phylogeny estimate for the advanced moths and butterflies (hexapoda: Lepidoptera)?

Soowon Cho, Andreas Zwick, Jerome C. Regier, Charles Mitter, Michael P. Cummings, Jianxiu Yao, Zaile Du, Hong Zhao, Akito Y. Kawahara, Susan Weller, Donald R. Davis, Joaquin Baixeras, John W. Brown, Cynthia Parr

Bell Museum of Natural History

Research output: Contribution to journal › Article › peer-review

84 Scopus citations

Abstract

This paper addresses the question of whether one can economically improve the robustness of a molecular phylogeny estimate by increasing gene sampling in only a subset of taxa, without having the analysis invalidated by artifacts arising from large blocks of missing data. Our case study stems from an ongoing effort to resolve poorly understood deeper relationships in the large clade Ditrysia (> 150,000 species) of the insect order Lepidoptera (butterflies and moths). Seeking to remedy the overall weak support for deeper divergences in an initial study based on five nuclear genes (6.6 kb) in 123 exemplars, we nearly tripled the total gene sample (to 26 genes, 18.4 kb) but only in a third (41) of the taxa. The resulting partially augmented data matrix (45% intentionally missing data) consistently increased bootstrap support for groupings previously identified in the five-gene (nearly) complete matrix, while introducing no contradictory groupings of the kind that missing data have been predicted to produce. Our results add to growing evidence that data sets differing substantially in gene and taxon sampling can often be safely and profitably combined. The strongest overall support for nodes above the family level came from including all nucleotide changes, while partitioning sites into sets undergoing mostly nonsynonymous versus mostly synonymous change. In contrast, support for the deepest node for which any persuasive molecular evidence has yet emerged (78-85% bootstrap) was weak or nonexistent unless synonymous change was entirely excluded, a result plausibly attributed to compositional heterogeneity. This node (Gelechioidea + Apoditrysia), tentatively proposed by previous authors on the basis of four morphological synapomorphies, is the first major subset of ditrysian superfamilies to receive strong statistical support in any phylogenetic study. A "more-genes-only" data set (41 taxa×26 genes) also gave strong signal for a second deep grouping (Macrolepidoptera) that was obscured, but not strongly contradicted, in more taxon-rich analyses.

Original language	English (US)
Pages (from-to)	782-796
Number of pages	15
Journal	Systematic Biology
Volume	60
Issue number	6
DOIs	https://doi.org/10.1093/sysbio/syr079
State	Published - Dec 2011

Bibliographical note

Funding Information:
FUNDING Financial support was provided by the US National Science Foundation’s Assembling the Tree of Life program, award numbers 0531626 and 0531769; the Spanish Government (Ministerio de Ciencia e Innovación) (CGL2008-00605 to J.B.); US National Science Foundation (DEB 0515699 to D. H. Janzen).

Keywords

Ditrysia
Hexapoda
Lepidoptera
gene sampling
missing data
molecular phylogenetics
nuclear genes
taxon sampling

Access

10.1093/sysbio/syr079

OpenUrl availability

Full text

Cite this

Cho, S., Zwick, A., Regier, J. C., Mitter, C., Cummings, M. P., Yao, J., Du, Z., Zhao, H., Kawahara, A. Y., Weller, S., Davis, D. R., Baixeras, J., Brown, J. W., & Parr, C. (2011). Can deliberately incomplete gene sample augmentation improve a phylogeny estimate for the advanced moths and butterflies (hexapoda: Lepidoptera)? Systematic Biology, 60(6), 782-796. https://doi.org/10.1093/sysbio/syr079

Cho, S, Zwick, A, Regier, JC, Mitter, C, Cummings, MP, Yao, J, Du, Z, Zhao, H, Kawahara, AY, Weller, S, Davis, DR, Baixeras, J, Brown, JW & Parr, C 2011, 'Can deliberately incomplete gene sample augmentation improve a phylogeny estimate for the advanced moths and butterflies (hexapoda: Lepidoptera)?', Systematic Biology, vol. 60, no. 6, pp. 782-796. https://doi.org/10.1093/sysbio/syr079

@article{e016857b373544ee81e949946030b92b,

title = "Can deliberately incomplete gene sample augmentation improve a phylogeny estimate for the advanced moths and butterflies (hexapoda: Lepidoptera)?",

abstract = "This paper addresses the question of whether one can economically improve the robustness of a molecular phylogeny estimate by increasing gene sampling in only a subset of taxa, without having the analysis invalidated by artifacts arising from large blocks of missing data. Our case study stems from an ongoing effort to resolve poorly understood deeper relationships in the large clade Ditrysia (> 150,000 species) of the insect order Lepidoptera (butterflies and moths). Seeking to remedy the overall weak support for deeper divergences in an initial study based on five nuclear genes (6.6 kb) in 123 exemplars, we nearly tripled the total gene sample (to 26 genes, 18.4 kb) but only in a third (41) of the taxa. The resulting partially augmented data matrix (45% intentionally missing data) consistently increased bootstrap support for groupings previously identified in the five-gene (nearly) complete matrix, while introducing no contradictory groupings of the kind that missing data have been predicted to produce. Our results add to growing evidence that data sets differing substantially in gene and taxon sampling can often be safely and profitably combined. The strongest overall support for nodes above the family level came from including all nucleotide changes, while partitioning sites into sets undergoing mostly nonsynonymous versus mostly synonymous change. In contrast, support for the deepest node for which any persuasive molecular evidence has yet emerged (78-85% bootstrap) was weak or nonexistent unless synonymous change was entirely excluded, a result plausibly attributed to compositional heterogeneity. This node (Gelechioidea + Apoditrysia), tentatively proposed by previous authors on the basis of four morphological synapomorphies, is the first major subset of ditrysian superfamilies to receive strong statistical support in any phylogenetic study. A {"}more-genes-only{"} data set (41 taxa×26 genes) also gave strong signal for a second deep grouping (Macrolepidoptera) that was obscured, but not strongly contradicted, in more taxon-rich analyses.",

keywords = "Ditrysia, Hexapoda, Lepidoptera, gene sampling, missing data, molecular phylogenetics, nuclear genes, taxon sampling",

author = "Soowon Cho and Andreas Zwick and Regier, {Jerome C.} and Charles Mitter and Cummings, {Michael P.} and Jianxiu Yao and Zaile Du and Hong Zhao and Kawahara, {Akito Y.} and Susan Weller and Davis, {Donald R.} and Joaquin Baixeras and Brown, {John W.} and Cynthia Parr",

note = "Funding Information: FUNDING Financial support was provided by the US National Science Foundation{\textquoteright}s Assembling the Tree of Life program, award numbers 0531626 and 0531769; the Spanish Government (Ministerio de Ciencia e Innovaci{\'o}n) (CGL2008-00605 to J.B.); US National Science Foundation (DEB 0515699 to D. H. Janzen).",

year = "2011",

month = dec,

doi = "10.1093/sysbio/syr079",

language = "English (US)",

volume = "60",

pages = "782--796",

journal = "Systematic Biology",

issn = "1063-5157",