Symposium proposal |
Organizer: | Xing-Xing Shen (Zhejiang University, China) |
Co-organizer: | Xiaofan Zhou (South China Agricultural University, China) |
With Next Generation Sequencing (NGS) data being routinely used, phylogenetics is transforming into the era of genomics (i.e., Phylogenomics). These phylogenomic data matrices provide an unprecedented resolution of the tree of life, but inferring a reliable species phylogeny from a large dataset with hundreds of orthologous genes and tens to hundreds of species involves many challenges that create uncertainty with respect to maximum likelihood tree search and gross error with respect to model misspecification. In addition, accurately assessing support and uncertainty across branches is almost as important as building the species phylogeny. The phylogenetic analysis of SARS-CoV-2 stands as a prime example of these challenges; it has been increasingly difficult to keep the phylogeny up to pace with the thousands of new genomes sequenced every day, let alone the proper measurement of the uncertainty. To our knowledge, no concerted effort has been made to simultaneously discuss accurately building the species phylogeny and assessing support across branches in a symposium. In this symposium, we will bring together researchers from around the world who have been developing new approaches to inferring species phylogeny and assessing phylogenetic support in the era of genomics. We aim to discuss the relative merits and drawbacks of these approaches and to highlight common themes. |
S1-1
Phylogenomic inference of protein sequences with IQ-TREE 2
Minh Bui1
1Australian National University
Genome-scale data have now become routine in evolutionary studies, but the amount and complexity of data challenges our phylogenetic methods. In this talk, I will highlight recent advances in models and methods for phylogenomic inference and their implementations in IQ-TREE 2. A particular focus will be a new tool QMaker to estimate amino acid substitution models, which improve inference from protein sequences for animals, plants, birds, insects and yeasts (and other species).
S1-2
Probing the resolution of quartet molecular phylogenies by deep neural networks
Zhengting Zou1, Hongjiu Zhang2, Jianzhi Zhang3
1Institute of Zoology, Chinese Academy of Sciences, Beijing, China
2Microsoft, Inc., USA
3Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, USA
Phylogenetic inference based on molecular sequences has become a fundamental and routine task in evolutionary and other biological studies. Although many phylogenetic methods have been developed to explicitly take into account substitution models of sequence evolution, such methods could fail due to model misspecification or insufficiency, especially when there are heterogeneities in substitution processes across sites and among lineages. In this study, we propose to infer topologies of four-taxon trees by deep residual neural networks, a machine learning approach needing no explicit modeling of the sequence evolution process and having succeeded in solving complex nonlinear inference problems. We train residual networks on simulated protein sequence data with extensive amino acid substitution heterogeneities. We show that the well-trained residual network predictors can outperform existing state-of-the-art inference methods such as the maximum likelihood method on diverse simulated test data, especially under extensive substitution heterogeneities. Reassuringly, residual network predictors generally agree with existing methods in the trees inferred from real phylogenetic data with known or widely believed topologies. Furthermore, when combined with the quartet puzzling algorithm, residual network predictors can be used to reconstruct trees with more than four taxa. We conclude that deep learning represents a powerful new approach to phylogenetic reconstruction, especially when sequences evolve via heterogeneous substitution processes. We present our best trained predictor in a freely available program named Phylogenetics by Deep Learning (PhyDL, https://gitlab.com/ztzou/phydl)
S1-3
Confidence and truth in phylogenomics
Rob Lanfear1
1Australian National University, Canberra
Phylogenies form the backbone of our understanding of the tree of life, and are crucial for understanding and tracking emerging diseases. Accurately measuring and communicating uncertainty in phylogenies is almost as important as building the phylogeny itself. Using the right measures of uncertainty can help avoid meaningless arguments, and in the case of emerging diseases can help make the right public health decisions and avoid the wrong ones.
In this talk I'll introduce a range of methods for measuring and communicating uncertainty in phylogenetics (bootstraps, rootstraps, concordance factors, and branch parsimony scores), and illustrate how and why each should be used with examples from estimating the tree of life to the genomic epidemiology of SARS-CoV-2.
S1-4
Building a reliable phylogeny of a very large collection of SARS-CoV-2 genomes
Marcos Caraballo-Ortiz1, Sayaka Miura1, Sudhir Kumar1
1Temple University
Building reliable phylogenies from very large collections of sequences with low variation and significant sequencing error have been challenging. The sequencing error interferes with the phylogenetic signal, making robust phylogenetic inference difficult, well-documented for SARS-CoV-2. Massive global sequencing of SARS-CoV-2 genomes has produced very tall sequence alignments, as the number of phylogenetically informative positions are orders of magnitude smaller than the number of sequences. These data also suffer from significant sequencing errors. We show that the use of high-frequency strain haplotypes harboring common variants improves the signal-to-noise ratio and can produce a robust phylogeny. We apply this TopStrains method to build a well-resolved phylogeny of more than 300,000 SARS-CoV-2 genomes, in which key episodes of SARS-CoV-2 evolution are highly supported in a genome-resampling bootstrap test. The root of the SARS-CoV-2 phylogeny, the most recent common ancestor sequence, and the orientation of mutational changes in the phylogeny of major strains are the same as those produced by an independent mutation order analysis. TopStrains is computationally efficient and scales gracefully with the increasing size of the genome collection.
S1-5
Evaluating Machine-Learning Based Phylogenetic Programs Using Simulated Data Sets
Yixiao Zhu1, Chuhao Li2, Xing-Xing Shen1, Xiaofan Zhou2
1Zhejiang University
2South China Agricultural University
Phylogenetic trees are essential for studying biology. Taking advantage of its high speed and non-explicit models, machine learning was recently applied to infer phylogenetic tree from multiple sequence alignment and showed promising performance. However, as these machine-learning programs all rely on simulated training data, their performance on datasets whose properties are not covered in the training data remains unexplored. Here, we assessed the accuracy of machine learning-based and traditional maximum likelihood-based phylogenetic programs on datasets simulated under long-branch attraction and long-branch repulsion conditions. Our results show that machine learning-based programs are more likely to make incorrect inferences compared with maximum likelihood-based programs, particularly when the tested data set is extremely different from training data. These results might provide insights to facilitate future development of machine learning-based phylogenetic methods.