Pitfalls in supermatrix phylogenomics

  • Hervé Philippe Centre de Théorisation et de Modélisation de la Biodiversité, Station d’Ecologie Théorique et Expérimentale, UMR CNRS 5321, 09200 Moulis
  • Damien M. de Vienne Laboratoire de Biométrie et Biologie Evolutive, CNRS, UMR 5558, Université Lyon 1, 69622 Villeurbanne
  • Vincent Ranwez SupAgro, UMR AGAP, 34398 Montpellier
  • Béatrice Roure Centre de Théorisation et de Modélisation de la Biodiversité, Station d’Ecologie Théorique et Expérimentale, UMR CNRS 5321, 09200 Moulis
  • Denis Baurain InBioS-PhytoSYSTEMS – Eukaryotic Phylogenomics, Université de Liège, Liège
  • Frédéric Delsuc Institut des Sciences de l’Evolution, UMR 5554, CNRS, IRD, EPHE, Université de Montpellier, Montpellier,
Keywords: phylogenomics, supermatrix, systematic error, data quality, incongruence

Abstract

In the mid-2000s, molecular phylogenetics turned into phylogenomics, a development that improved the resolution of phylogenetic trees through a dramatic reduction in stochastic error. While some then predicted “the end of incongruence”, it soon appeared that analysing large amounts of sequence data without an adequate model of sequence evolution amplifies systematic error and leads to phylogenetic artefacts. With the increasing flood of (sometimes low-quality) genomic data resulting from the rise of high-throughput sequencing, a new type of error has emerged. Termed here “data errors”, it lumps together several kinds of issues affecting the construction of phylogenomic supermatrices (e.g., sequencing and annotation errors, contaminant sequences). While easy to deal with at a single-gene scale, such errors become very difficult to avoid at the genomic scale, both because hand curating thousands of sequences is prohibitively time-consuming and because the suitable automated bioinformatics tools are still in their infancy. In this paper, we first review the pitfalls affecting the construction of supermatrices and the strategies to limit their adverse effects on phylogenomic inference. Then, after discussing the relative non-issue of missing data in supermatrices, we briefly present the approaches commonly used to reduce systematic error.

References

Abascal F., Zardoya R. & Telford M.J. 2010. TranslatorX: multiple alignment of nucleotide sequences guided by amino acid translations. Nucleic Acids Research 38: W7–13. http://dx.doi.org/10.1093/nar/gkq291

Altschul S.F. & Lipman D.J. 1990. Protein database searches for multiple alignments. Proceedings of the National Academy of Sciences 87: 5509–5513. http://dx.doi.org/10.1073/pnas.87.14.5509

Baguna J. & Riutort M. 2004. The dawn of bilaterian animals: the case of acoelomorph flatworms. BioEssays 26: 1046–1057. http://dx.doi.org/10.1002/bies.20113

Bininda-Emonds O.R. 2005. transAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences. BMC Bioinformatics 6: e156. http://dx.doi.org/10.1186/1471-2105-6-156

Blanquart S. & Lartillot N. 2008. A site- and time-heterogeneous model of amino acid replacement. Molecular Biology and Evolution 25: 842–858. http://dx.doi.org/10.1093/molbev/msn018

Bourlat S.J., Nielsen C., Lockyer A.E., Littlewood D.T. & Telford M.J. 2003. Xenoturbella is a deuterostome that eats molluscs. Nature 424: 925–928. http://dx.doi.org/10.1038/nature01851

Bradley R.K., Roberts A., Smoot M., Juvekar S., Do J., Dewey C., Holmes I. & Pachter L. 2009. Fast statistical alignment. PLoS Computational Biology 5: e1000392. http://dx.doi.org/10.1371/journal.pcbi.1000392

Brinkmann H. & Philippe H. 1999. Archaea sister group of Bacteria? Indications from tree reconstruction artifacts in ancient phylogenies. Molecular Biology and Evolution 16: 817–825.

Brinkmann H., Giezen M., Zhou Y., Raucourt G.P. & Philippe H. 2005. An empirical assessment of long-branch attraction artefacts in deep eukaryotic phylogenomics. Systematic Biology 54: 743–757. http://dx.doi.org/10.1080/10635150500234609

Brown J.M. 2014. Detection of implausible phylogenetic inferences using posterior predictive assessment of model fit. Systematic Biology 63: 334–348. http://dx.doi.org/10.1093/sysbio/syu002

Cannon J.T., Vellutini B.C., Smith 3rd J., Ronquist F., Jondelius U. & Hejnol A. 2016. Xenacoelomorpha is the sister group to Nephrozoa. Nature 530: 89–93. http://dx.doi.org/10.1038/nature16520

Capella-Gutierrez S., Silla-Martinez J.M. & Gabaldon T. 2009. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25: 1972–1973. http://dx.doi.org/10.1093/bioinformatics/btp348

Castresana J. 2000. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Molecular Biology and Evolution 17: 540–552.

Chang J.M., Di Tommaso P. & Notredame C. 2014. TCS: a new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction. Molecular Biology and Evolution 31: 1625–1637. http://dx.doi.org/10.1093/molbev/msu117

Chessel D. & Hanafi M. 1996. Analyses de la co-inertie de K nuages de points. Revue de Statistique Appliquée 44 (2): 35–60.

Criscuolo A. & Gribaldo S. 2010. BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evolutionary Biology 10: e210. http://dx.doi.org/10.1186/1471-2148-10-210

Dayhoff M.O., Schwartz R.M. & Orcutt B.C. 1978. A model of evolutionary change in proteins. In: Dayhoff M.O. (ed.) Atlas of Protein Sequences and Structure: 345–352. National Biomedical Research Foundation, Washington DC.

Delsuc F., Brinkmann H. & Philippe H. 2005. Phylogenomics and the reconstruction of the tree of life. Nature Reviews Genetics 6: 361–375. http://dx.doi.org/10.1038/nrg1603

Delsuc F., Brinkmann H., Chourrout D. & Philippe H. 2006. Tunicates and not cephalochordates are the closest living relatives of vertebrates. Nature 439: 965–968. http://dx.doi.org/10.1038/nature04336

Driskell A.C., Ane C., Burleigh J.G., McMahon M.M., O'Meara B.C. & Sanderson M.H. 2004. Prospects for building the Tree of Life from large sequence databases. Science 306: 1172–1174. http://dx.doi.org/10.1126/science.1102036

Dunn C.W., Hejnol A., Matus D.Q., Pang K., Browne W.E., Smith S.A., Seaver E., Rouse G.W., Obst M., Edgecombe G.D., Sørensen M.V., Haddock S.H., Schmidt-Rhaesa A., Okusu A., Kristensen R.M., Wheeler W.C., Martindale M.Q. & Giribet G. 2008. Broad phylogenomic sampling improves resolution of the animal tree of life. Nature 452: 745–749. http://dx.doi.org/10.1038/nature06614

Dutheil J.Y. & Figuet E. 2015. Optimization of sequence alignments according to the number of sequences vs. number of sites trade-off. BMC Bioinformatics 16: e190. http://dx.doi.org/10.1186/s12859-015-0619-8

Edgar R.C. 2004. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5: e113. http://dx.doi.org/10.1186/1471-2105-5-113

Eyre-Walker A. 1993. Recombination and mammalian genome evolution. Proceedings of the the Royal Society B 252: 237–243. http://dx.doi.org/10.1098/rspb.1993.0071

Felsenstein J. 1978. Cases in which parsimony or compatibility methods will be positively misleading. Systematic Zoology 27: 401–410. http://dx.doi.org/10.2307/2412923

Felsenstein J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution 17: 368–376.

Felsenstein J. 1988. Phylogenies from molecular sequences: inference and reliability. Annual Review of Genetics 22: 521–565. http://dx.doi.org/10.1146/annurev.ge.22.120188.002513

Finet C., Timme R.E., Delwiche C.F. & Marletaz F. 2010. Multigene phylogeny of the green lineage reveals the origin and diversification of land plants. Current Biology 20: 2217–2222. http://dx.doi.org/10.1016/j.cub.2010.11.035

Foster P.G. 2004. Modeling compositional heterogeneity. Systematic Biology 53: 485–495. http://dx.doi.org/10.1080/10635150490445779

Galtier N. & Gouy M. 1998. Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. Molecular Biology and Evolution 15: 871–879.

Gee H. 2003. Evolution: ending incongruence. Nature 425: 782. http://dx.doi.org/10.1038/425782a

Goldman N. & Yang Z. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Molecular Biology and Evolution 11: 725–736.

Gouy M., Guindon S. & Gascuel O. 2010. SeaView version 4: A multiplatform graphical user interface for sequence alignment and phylogenetic tree building. Molecular Biology and Evolution 27: 221–224. http://dx.doi.org/10.1093/molbev/msp259

Guindon S. & Gascuel O. 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Systematic Biology 52: 696–704. http://dx.doi.org/10.1080/10635150390235520

Hampl V., Hug L., Leigh J.W., Dacks J.B., Lang B.F., Simpson A.G. & Roger A.J. 2009. Phylogenomic analyses support the monophyly of Excavata and resolve relationships among eukaryotic “supergroups”. Proceedings of the National Academy of Sciences 106: 3859–3864. http://dx.doi.org/10.1073/pnas.0807880106

Hebert P.D., Cywinska A., Ball S.L. & deWaard J.R. 2003. Biological identifications through DNA barcodes. Proceedings of the Royal Society B 270: 313–321. http://dx.doi.org/10.1098/rspb.2002.2218

Hein J. 1990. Unified approach to alignment and phylogenies. Methods in Enzymology 183: 626–645. http://dx.doi.org/10.1016/0076-6879(90)83041-7

Hejnol A., Obst M., Stamatakis A., Ott M., Rouse G.W., Edgecombe G.D., Martinez P., Baguna J., Bailly X., Jondelius U., Wiens M., Muller W.E., Seaver E., Wheeler W.C., Martindale M.Q., Giribet G. & Dunn C.W. 2009. Assessing the root of bilaterian animals with scalable phylogenomic methods. Proceedings of the Royal Society B 276: 4261–4270. http://dx.doi.org/10.1098/rspb.2009.0896

Hendy M.D. & Penny D. 1989. A framework for the quantitative study of evolutionary trees. Systematic Zoology 38: 297–309. http://dx.doi.org/10.2307/2992396

Henikoff S. & Henikoff J.G. 1992. Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences 89: 10915–10919. http://dx.doi.org/10.1073/pnas.89.22.10915

Higgins D.G., Bleasby A.J. & Fuchs R. 1992. CLUSTAL V: improved software for multiple sequence alignment. Computer Applications in the Biosciences 8: 189–191.

Hosner P.A., Faircloth B.C., Glenn T.C., Braun E.L. & Kimball R.T. 2016. Avoiding missing data biases in phylogenomic inference: an empirical study in the landfowl (Aves: Galliformes). Molecular Biology and Evolution 33: 1110–1125. http://dx.doi.org/10.1093/molbev/msv347

Huang J. & Gogarten J.P. 2006. Ancient horizontal gene transfer can benefit phylogenetic reconstruction. Trends in Genetics 22: 361–366. http://dx.doi.org/10.1016/j.tig.2006.05.004

Huelsenbeck J.P. 1991. When are fossils better than extant taxa in phylogenetic analysis? Systematic Zoology 40: 458–469. http://dx.doi.org/10.2307/2992240

Huelsenbeck J.P. 2002. Testing a covariotide model of DNA substitution. Molecular Biology and Evolution 19: 698–707.

Husník F., Chrudimský T. & Hypša V. 2011. Multiple origins of endosymbiosis within the Enterobacteriaceae (γ-Proteobacteria): convergency of complex phylogenetic approaches. BMC Biology 9: e87. http://dx.doi.org/10.1186/1741-7007-9-87

Jarvis E.D., Mirarab S., Aberer A.J., Li B., Houde P., Li C., Ho S.Y., Faircloth B.C., Nabholz B., Howard J.T., Suh A., Weber C.C., da Fonseca R.R., Li J., Zhang F., Li H., Zhou L., Narula N., Liu L., Ganapathy G., Boussau B., Bayzid M.S., Zavidovych V., Subramanian S., Gabaldon T., Capella-Gutierrez S., Huerta-Cepas J., Rekepalli B., Munch K., Schierup M., Lindow B., Warren W.C., Ray D., Green R.E., Bruford M.W., Zhan X., Dixon A., Li S., Li N., Huang Y., Derryberry E.P., Bertelsen M.F., Sheldon F.H., Brumfield R.T., Mello C.V., Lovell P.V., Wirthlin M., Schneider M.P., Prosdocimi F., Samaniego J.A., Vargas Velazquez A.M., Alfaro-Nunez A., Campos P.F., Petersen B., Sicheritz-Ponten T., Pas A., Bailey T., Scofield P., Bunce M., Lambert D.M., Zhou Q., Perelman P., Driskell A.C., Shapiro B., Xiong Z., Zeng Y., Liu S., Li Z., Liu B., Wu K., Xiao J., Yinqi X., Zheng Q., Zhang Y., Yang H., Wang J., Smeds L., Rheindt F.E., Braun M., Fjeldsa J., Orlando L., Barker F.K., Jonsson K.A., Johnson W., Koepfli K.P., O’Brien S., Haussler D., Ryder O.A., Rahbek C., Willerslev E., Graves G.R., Glenn T.C., McCormack J., Burt D., Ellegren H., Alstrom P., Edwards S.V., Stamatakis A., Mindell D.P., Cracraft J., Braun E.L., Warnow T., Jun W., Gilbert M.T. & Zhang G. 2014. Whole-genome analyses resolve early branches in the tree of life of modern birds. Science 346: 1320–1331. http://dx.doi.org/10.1126/science.1253451

Jeffroy O., Brinkmann H., Delsuc F. & Philippe H. 2006. Phylogenomics: the beginning of incongruence? Trends in Genetics 22: 225–231. http://dx.doi.org/10.1016/j.tig.2006.02.003

Katoh K., Kuma K., Toh H. & Miyata T. 2005. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Research 33: 511–518. http://dx.doi.org/10.1093/nar/gki198

Kim J. & Ma J. 2011. PSAR: measuring multiple sequence alignment reliability by probabilistic sampling. Nucleic Acids Research 39: 6359–6368. http://dx.doi.org/10.1093/nar/gkr334

Kluge A. & Farris J. 1969. Quantitative phyletics and the evolution of anurans. Systematic Zoology 30: 1–32.

Kocot K.M., Cannon J.T., Todt C., Citarella M.R., Kohn A.B., Meyer A., Santos S.R., Schander C., Moroz L.L., Lieb B. & Halanych K.M. 2011. Phylogenomics reveals deep molluscan relationships. Nature 477: 452–456. http://dx.doi.org/10.1038/nature10382

Koski L.B. & Golding G.B. 2001. The closest BLAST hit is often not the nearest neighbor. Journal of Molecular Evolution 52: 540–542. http://dx.doi.org/10.1007/s002390010184

Lanave C., Preparata G., Saccone C. & Serio G. 1984. A new method for calculating evolutionary substitution rates. Journal of Molecular Evolution 20: 86–93. http://dx.doi.org/10.1007/BF02101990

Landan G. & Graur D. 2007. Heads or tails: a simple reliability check for multiple sequence alignments. Molecular Biology and Evolution 24: 1380–1383. http://dx.doi.org/10.1093/molbev/msm060

Landan G. & Graur D. 2008. Local reliability measures from sets of co-optimal multiple sequence alignments. Pacific Symposium on Biocomputing 13: 15–24.

Lartillot N. & Philippe H. 2004. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Molecular Biology and Evolution 21: 1095–1109. http://dx.doi.org/10.1093/molbev/msh112

Lartillot N., Rodrigue N., Stubbs D. & Richer J. 2013. PhyloBayes MPI: phylogenetic reconstruction with infinite mixtures of profiles in a parallel environment. Systematic Biology 62: 611–615. http://dx.doi.org/10.1093/sysbio/syt022

Lassmann T. & Sonnhammer E.L. 2007. Automatic extraction of reliable regions from multiple sequence alignments. BMC Bioinformatics 8 (Suppl. 5): S9. http://dx.doi.org/10.1186/1471-2105-8-S5-S9

Laurin-Lemay S., Brinkmann H. & Philippe H. 2012. Origin of land plants revisited in the light of sequence contamination and missing data. Current Biology 22: R593–594. http://dx.doi.org/10.1016/j.cub.2012.06.013

Leebens-Mack J., Vision T., Brenner E., Bowers J.E., Cannon S., Clement M.J., Cunningham C.W., dePamphilis C., deSalle R., Doyle J.J., Eisen J.A., Gu X., Harshman J., Jansen R.K., Kellogg E.A., Koonin E.V., Mishler B.D., Philippe H., Pires J.C., Qiu Y.L., Rhee S.Y., Sjölander K., Soltis D.E., Soltis P.S., Stevenson D.W., Wall K., Warnow T. & Zmasek C. 2006. Taking the first steps towards a standard for reporting on phylogenies: Minimum Information About a Phylogenetic Analysis (MIAPA). OMICS: A Journal of Integrative Biology 10: 231–237. http://dx.doi.org/10.1089/omi.2006.10.231

Lemmon A.R., Brown J.M., Stanger-Hall K. & Lemmon E.M. 2009. The effect of ambiguous data on phylogenetic estimates obtained by maximum likelihood and Bayesian inference. Systematic Biology 58: 130–145. http://dx.doi.org/10.1093/sysbio/syp017

Lewis P.O., Holder M.T. & Swofford D.L. 2015. Phycas: software for Bayesian phylogenetic analysis. Systematic Biology 64: 525–531. http://dx.doi.org/10.1093/sysbio/syu132

Liu K., Warnow T.J., Holder M.T., Nelesen S.M., Yu J., Stamatakis A.P. & Linder C.R. 2012. SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. Systematic Biology 61: 90–106. http://dx.doi.org/10.1093/sysbio/syr095

Loytynoja A. & Goldman N. 2005. An algorithm for progressive multiple alignment of sequences with insertions. Proceedings of the National Academy of Sciences 102: 10557–10562. http://dx.doi.org/10.1073/pnas.0409137102

Loytynoja A. & Milinkovitch M.C. 2001. SOAP, cleaning multiple alignments from unstable blocks. Bioinformatics 17: 573–574. http://dx.doi.org/10.1093/bioinformatics/17.6.573

Morrison D.A. 2006. L.A.S. Johnson Review No. 8. Multiple sequence alignment for phylogenetic purposes. Australian Systematic Botany 19: 479–539. http://dx.doi.org/10.1071/SB06020

Morrison D.A. 2009. Why would phylogeneticists ignore computerized sequence alignment? Systematic Biology 58: 150–158. http://dx.doi.org/10.1093/sysbio/syp009

Morrison D.A. & Ellis J.T. 1997. Effects of nucleotide sequence alignment on phylogeny estimation: a case study of 18S rDNAs of Apicomplexa. Molecular Biology and Evolution 14: 428–441.

Notredame C., Higgins D.G. & Heringa J. 2000. T-Coffee: A novel method for fast and accurate multiple sequence alignment. Journal of Molecular Biology 302: 205–217. http://dx.doi.org/10.1006/jmbi.2000.4042

Ogden T.H. & Rosenberg M.S. 2006. Multiple sequence alignment accuracy and phylogenetic inference. Systematic Biology 55: 314–328. http://dx.doi.org/10.1080/10635150500541730

Okusu A. & Giribet G. 2003. New 18S rRNA sequences from neomenioid aplacophorans and the possible origin of persistent exogenous contamination. Journal of Molluscan Studies 69: 385–387. http://dx.doi.org/10.1093/mollus/69.4.385

Olsen G. 1987. Earliest phylogenetic branching: comparing rRNA-based evolutionary trees inferred with various techniques. Cold Spring Harbor Symposia on Quantitative Biology 52: 825–837.

Pagel M. & Meade A. 2004. A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Systematic Biology 53: 571–581. http://dx.doi.org/10.1080/10635150490468675

Pawlowski J., Bolivar I., Fahrni J.F., Cavalier-Smith T. & Gouy M. 1996. Early origin of Foraminifera suggested by SSU rRNA gene sequences. Molecular Biology and Evolution 13: 445–450.

Penn O., Privman E., Ashkenazy H., Landan G., Graur D. & Pupko T. 2010. GUIDANCE: a web server for assessing alignment confidence scores. Nucleic Acids Research 38: W23–28. http://dx.doi.org/10.1093/nar/gkq443

Philippe H. 2011. Une décroissance de la recherche scientifique pour rendre la science durable? In: Abraham, Y.-M., Marion, L., Philippe, H. (eds) Décroissance versus Développement Durable: Débats Pour la Suite du Monde: 166–186. Écosociété, Montréal.

Philippe H. & Roure B. 2011. Difficult phylogenetic questions: more data, maybe; better methods, certainly. BMC Biology 9: e91. http://dx.doi.org/10.1186/1741-7007-9-91

Philippe H., Snell E.A., Bapteste E., Lopez P., Holland P.W. & Casane D. 2004. Phylogenomics of eukaryotes: impact of missing data on large alignments. Molecular Biology and Evolution 21: 1740–1752. http://dx.doi.org/10.1093/molbev/msh182

Philippe H., Delsuc F., Brinkmann H. & Lartillot N. 2005a. Phylogenomics. Annual Review of Ecology, Evolution, and Systematics 36: 541–562. http://dx.doi.org/10.1146/annurev.ecolsys.35.112202.130205

Philippe H., Lartillot N. & Brinkmann H. 2005b. Multigene analyses of bilaterian animals corroborate the monophyly of Ecdysozoa, Lophotrochozoa, and Protostomia. Molecular Biology and Evolution 22: 1246–1253. http://dx.doi.org/10.1093/molbev/msi111

Philippe H., Brinkmann H., Martinez P., Riutort M. & Baguna J. 2007. Acoel flatworms are not platyhelminthes: evidence from phylogenomics. PLoS One 2: e717. http://dx.doi.org/10.1371/journal.pone.0000717

Philippe H., Derelle R., Lopez P., Pick K., Borchiellini C., Boury-Esnault N., Vacelet J., Renard E., Houliston E., Queinnec E., Da Silva C., Wincker P., Le Guyader H., Leys S., Jackson D.J., Schreiber F., Erpenbeck D., Morgenstern B., Worheide G. & Manuel M. 2009. Phylogenomics revives traditional views on deep animal relationships. Current Biology 19: 706–712. http://dx.doi.org/10.1016/j.cub.2009.02.052

Philippe H., Brinkmann H., Copley R.R., Moroz L.L., Nakano H., Poustka A.J., Wallberg A., Peterson K.J. & Telford M.J. 2011a. Acoelomorph flatworms are deuterostomes related to Xenoturbella. Nature 470: 255–258. http://dx.doi.org/10.1038/nature09676

Philippe H., Brinkmann H., Lavrov D.V., Littlewood D.T., Manuel M., Worheide G. & Baurain D. 2011b. Resolving difficult phylogenetic questions: why more sequences are not enough. PLoS Biology 9: e1000602. http://dx.doi.org/10.1371/journal.pbio.1000602

Phillips M.J., Delsuc F. & Penny D. 2004. Genome-scale phylogeny and the detection of systematic biases. Molecular Biology and Evolution 21: 1455–1458. http://dx.doi.org/10.1093/molbev/msh137

Pisani D. 2004. Identifying and removing fast-evolving sites using compatibility analysis: An example from the Arthropoda. Systematic Biology 53: 978–989. http://dx.doi.org/10.1080/10635150490888877

Poirot O., O’Toole E. & Notredame C. 2003. Tcoffee@igs: A web server for computing, evaluating and combining multiple sequence alignments. Nucleic Acids Research 31: 3503–3506. http://dx.doi.org/10.1093/nar/gkg522

Prakash A. & Tompa M. 2005. Statistics of local multiple alignments. Bioinformatics 21 (Suppl. 1): i344–i350. http://dx.doi.org/10.1093/bioinformatics/bti1042

Ranwez V., Harispe S., Delsuc F. & Douzery E.J. 2011. MACSE: Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons. PloS One 6: e22594. http://dx.doi.org/10.1371/journal.pone.0022594

Rokas A., Williams B.L., King N. & Carroll S.B. 2003. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425: 798–804. http://dx.doi.org/10.1038/nature02053

Romiguier J., Ranwez V., Delsuc F., Galtier N. & Douzery E.J. 2013. Less is more in mammalian phylogenomics: AT-rich genes minimize tree conflicts and unravel the root of placental mammals. Molecular Biology and Evolution 30: 2134–2144. http://dx.doi.org/10.1093/molbev/mst116

Roure B. & Philippe H. 2011. Site-specific time heterogeneity of the substitution process and its impact on phylogenetic inference. BMC Evolutionary Biology 11: e17. http://dx.doi.org/10.1186/1471-2148-11-17

Roure B., Baurain D. & Philippe H. 2013. Impact of missing data on phylogenies inferred from empirical phylogenomic data sets. Molecular Biology and Evolution 30: 197–214. http://dx.doi.org/10.1093/molbev/mss208

Rouse G.W., Wilson N.G., Carvajal J.I. & Vrijenhoek R.C. 2016. New deep-sea species of Xenoturbella and the position of Xenacoelomorpha. Nature 530: 94–97. http://dx.doi.org/10.1038/nature16545

Sanderson M.J. & Shaffer H.B. 2002. Troubleshooting molecular phylogenetic analyses. Annual Review of Ecology and Systematics 33: 49–72. http://dx.doi.org/10.1146/annurev.ecolsys.33.010802.150509

Sanderson M.J., Driskell A.C., Ree R.H., Eulenstein O. & Langley S. 2003. Obtaining maximal concatenated phylogenetic datasets from large sequence databases. Molecular Biology and Evolution 20: 1036–1042. http://dx.doi.org/10.1093/molbev/msg115

Savill N.J., Hoyle D.C. & Higgs P.G. 2001. RNA sequence evolution with secondary structure constraints: comparison of substitution rate models using maximum-likelihood methods. Genetics 157: 399–411.

Schierwater B., Eitel M., Jakob W., Osigus H.J., Hadrys H., Dellaporta S.L., Kolokotronis S.O. & Desalle R. 2009. Concatenated analysis sheds light on early metazoan evolution and fuels a modern “urmetazoon” hypothesis. PLoS Biology 7: e20. http://dx.doi.org/10.1371/journal.pbio.1000020

Smith S.A., Wilson N.G., Goetz F.E., Feehery C., Andrade S.C., Rouse G.W., Giribet G. & Dunn C.W. 2011. Resolving the evolutionary relationships of molluscs with phylogenomic tools. Nature 480: 364–367. http://dx.doi.org/10.1038/nature10526

Soltis D.E., Albert V.A., Savolainen V., Hilu K., Qiu Y.L., Chase M.W., Farris J.S., Stefanovic S., Rice D.W., Palmer J.D. & Soltis P.S. 2004. Genome-scale data, angiosperm relationships, and “ending incongruence”: a cautionary tale in phylogenetics. Trends in Plant Science 9: 477–483. http://dx.doi.org/10.1016/j.tplants.2004.08.008

Stamatakis A. & Ott M. 2008. Efficient computation of the phylogenetic likelihood function on multi-gene alignments and multi-core architectures. Philosophical Transactions of the Royal Society of London B 363: 3977–3984. http://dx.doi.org/10.1098/rstb.2008.0163

Steel M. 2005. Should phylogenetic models be trying to “fit an elephant”? Trends in Genetics 21: 307–309. http://dx.doi.org/10.1016/j.tig.2005.04.001

Streicher J.W., Schulte 2nd J.A. & Wiens J.J. 2016. How should genes and taxa be sampled for phylogenomic analyses with missing data? An empirical study in iguanian lizards. Systematic Biology 65: 128–145. http://dx.doi.org/10.1093/sysbio/syv058

Sun L., Fang L., Zhang Z., Chang X., Penny D. & Zhong B. 2016. Chloroplast phylogenomic inference of green algae relationships. Nature Science Reports 6: e20528. http://dx.doi.org/10.1038/srep20528

Szollosi G.J., Tannier E., Daubin V. & Boussau B. 2015. The inference of gene trees with species trees. Systematic Biology 64: e42–62. http://dx.doi.org/10.1093/sysbio/syu048

Talavera G. & Castresana J. 2007. Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Systematic Biology 56: 564–577. http://dx.doi.org/10.1080/10635150701472164

Tamas I., Wernegreen J.J., Nystedt B., Kauppinen S.N., Darby A.C., Gomez-Valero L., Lundin D., Poole A.M. & Andersson S.G. 2008. Endosymbiont gene functions impaired and rescued by polymerase infidelity at poly(A) tracts. Proceedings of the National Academy of Sciences 105: 14934–14939. http://dx.doi.org/10.1073/pnas.0806554105

Tan G., Muffato M., Ledergerber C., Herrero J., Goldman N., Gil M. & Dessimoz C. 2015. Current methods for automated filtering of multiple sequence alignments frequently worsen single-gene phylogenetic inference. Systematic Biology 64: 778–791. http://dx.doi.org/10.1093/sysbio/syv033

Vienne D.M. de, Ollier S. & Aguileta G. 2012. Phylo-MCOA: A fast and efficient method to detect outlier genes and species in phylogenomics using multiple co-inertia analysis. Molecular Biology and Evolution 29: 1587–1598. http://dx.doi.org/10.1093/molbev/msr317

Wiens J.J. 2003. Missing data, incomplete taxa, and phylogenetic accuracy. Systematic Biology 52: 528–538. http://dx.doi.org/10.1080/10635150390218330

Wiens J.J. 2005. Can incomplete taxa rescue phylogenetic analyses from long-branch attraction? Systematic Biology 54: 731–742. http://dx.doi.org/10.1080/10635150500234583

Wiens J.J. & Morrill M.C. 2011. Missing data in phylogenetic analysis: reconciling results from simulations and empirical data. Systematic Biology 60: 719–731. http://dx.doi.org/10.1093/sysbio/syr025

Wodniok S., Brinkmann H., Glockner G., Heidel A.J., Philippe H., Melkonian M. & Becker B. 2011. Origin of land plants: do conjugating green algae hold the key? BMC Evolutionary Biology 11: e104. http://dx.doi.org/10.1186/1471-2148-11-104

Wong K.M., Suchard M.A. & Huelsenbeck J.P. 2008. Alignment uncertainty and genomic analysis. Science 319: 473–476. http://dx.doi.org/10.1126/science.1151532

Wu M., Chatterji S. & Eisen J.A. 2012. Accounting for alignment uncertainty in phylogenomics. PloS One 7: e30288. http://dx.doi.org/10.1371/journal.pone.0030288

Yang Z. 1993. Maximum-Likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Molecular Biology and Evolution 10: 1396–1401.

Yang Z. 1996. Maximum-Likelihood models for combined analyses of multiple sequence data. Journal of Molecular Evolution 42: 587–596. http://dx.doi.org/10.1007/BF02352289

Published
2017-02-21
How to Cite
Philippe, H., Vienne, D. M. de, Ranwez, V., Roure, B., Baurain, D., & Delsuc, F. (2017). Pitfalls in supermatrix phylogenomics. European Journal of Taxonomy, (283). https://doi.org/10.5852/ejt.2017.283