Identification of sources of errors in molecular phylogenetic analyses
Probabilistic based tree reconstructions methods are considered to be very robust if model violations are not too extensive. However, it is unclear what too extensive could mean and doubts can be raised whether even the most robust tree inference method, the maximum likelihood (ML) approach, delivers reliable results with empirical data.
All analyses of the project were performed and evaluated with LoBraTe (Long Branch Test Pipeline), a self-developed Perl pipeline for sequence simulation, phylogenetic reconstruction and evaluation.
Part I - Testing the robustness and efficiency of ML
The first part of our project is to analyze the robustness and efficiency of ML given different amount of model misspecifications on multiple-taxon trees. We start from simulated data and a broad range of different branch length conditions. Our results show that the risk of obtaining a wrong topology using ML is dependent on the arrangement of the edges. Especially if underlying data has been evolved under strong branch length heterogeneities of closely related short and long internal branches. In that case, ML cannot recover the true tree even if there is only minor model miss-specification for long alignments (100,000 bp). This class of topologies has not been investigated before and constitutes a new example for which ML efficiency is low (Kück et al. 2012: Long branch effects distort maximum likelihood phylogenies in simulations despite selection of the correct model; Kück et al. 2014a: Systematic errors in maximum-likelihood tree inference; Kück et al. 2014b: Topological bias of maximum-likelihood trees inferred from star phylogenies in the event of correct and incorrect model assumptions).
ML tree reconstruction success on simulated multiple-taxon trees (100 kbp) in which two closely related internal branches (BL2, red) have been stepwise increased (x-axis) while the innermost branch (blue) has been kept constantly short. 100 data sets have been simulated and analysed for each branch length condition (y-axis). The number of an incorrect sister-group relationship of sequences L5 and L6 (named LBA class I, sensu Wägele & Mayer 2007) are shown as red plot line, the number of correct ML tree reconstructions is shown as blue plot line. The only model miss-specification has been the usage of four substitutional rate categories for the tree reconstruction process instead of using a continious gamma distribution as we did for our data simulations (Kück et al. 2012: Long branch effects distort maximum likelihood phylogenies in simulations despite selection of the correct model).
Altogether, the high confidence that is put in probabilistic trees like ML is not always justified under a certain range of tree shape conditions even if alignment lengths reach 100,000 bp.
Part II - Identification of sources of error
Searching for sources of error we focus on character polarity, a feature which is usually thought to be irrelevant in ML. Mechanisms that lead to wrong tree topologies were analysed at the level of split-supporting site patterns. In simulations, plesiomorphic site patterns can be identified by comparison with known root sequences. These patterns cause some surprising
effects: Using data sets generated with simulations of sequence evolution along a variety of topologies and inferring trees using the same (correct) model, we show for cases of branch-length heterogeneity that (i) as already known, ML analyses can fail to recover the correct tree even when the correct substitution model is used, but also that (ii) plesiomorphic character states cause substantial mistakes and therefore character polarity is relevant, and (iii) accumulating chance similarities on long branches are far less misleading than plesiomorphic states accumulating on shorter branches. The artefacts occur when branch lengths are heterogeneous. The systematic errors disappear for the most part when the sites with symplesiomorphies supporting false clades are deleted from the data set. We conclude that many of the phylogenies published during the past decades may be false due to the neglected effects of symplesiomorphies (Kück & Wägele 2015: Plesiomorphic character states cause systematic errors in molecular phylogenetic analyses: a simulation study).
(a) Example of a rooted tree with 11 taxa and elongated inner branches (red) used for simulations. (b) Accumulation of symplesiomorphies for the paraphyletic short-branch clade (S1, S2). Y-axis: portion of alignment sites. X-axis: branch length of elongated branches. (c) and (d) ML Reconstruction success with (c) and without (d) sites that contain symplesiomorphies supporting the paraphyletic group (S1, S2). Blue curve: frequency of inferred correct trees (without long-branch artefacts). Red curve: Frequency of topologies containing the false clade (S1, S2). Brown curve: other topologies.
In our study, all situations that produce artefacts based on symplesiomorphies are cases with branch length heterogeneity. The attraction caused by symplesiomorphies can group short branches when they occur in the same vicinity. We tested topologies with long terminal branches and with elongated inner edges. It seems that the place of the longest branches in the topology is not relevant.
Schematic overview showing an accumulation of plesiomorphic characters (P), finally shared only between the two short terminal branches (SPS). The shorter the innermost branch length compared to the preceding ancestral branch, the lower the ratio of plesiomorphic substitutions (A1) related to the number of remaining P (evolved along the ancestral branch). Further, the stronger terminal branch length heterogeneities between sister-taxa in the following branching event, the stronger the negative impact of accumulated SPS in slow evolving, short terminal branches to tree reconstruction compared with the impact of evolved random sequence similarities along long terminal branches (RSS).
In the real world this could mean that slowly evolving taxa may form a seemingly monophyletic collecting bin, with fast-evolving taxa in false sister groups. To study this effect, we still need algorithms that allow site-pattern analyses in empirical data.