Tuesday, August 29, 2017

More non-treelike data forced into trees: a glimpse into the dinosaurs

Plant morphological data sets including fossil taxa can be riddled with incompatible data patterns (e.g. see my first post), and this can be a bit mind-blowing when it comes to tracing evolution over time. So, let’s move on to something potentially more simple: extinct groups of animals.

Until a time-machine is invented, phylogenetic hypotheses for groups such as the many extinct lineages of dinosaurs will have to be based on morphological data sets. Dinosaur fossils are nowhere near as frequent as as plant fossils (often isolated organ); but when a complete or partial skeleton is found, this specimen allows scoring more characters than is possible for even a higher-level composite plant taxon. For instance, the largest (character-wise) plant data matrices, using composite taxa and operating at the level of genera and above, including fossils, have a little over 100 characters, whereas dinosaur matrices like the one used by Tschopp, Mateus & Benson (2015) can have several hundreds of characters.

Classification of dinosaurs tries to apply the principles of ‘cladistics’ (see also http://tolweb.org/Dinosauria), a classification system established by Hennig (1950). Cladistic classification – Hennig did not propose any inference framework – aims to identify exclusively shared derived traits (synapomorphies), and consequently groups of taxa (originally species) that share an inclusive common origin, Hennig's “monophyla”. [In contrast to Haeckel’s (1866) concept of monophyletic groups, which just assumed a common origin, but did not require inclusiveness.] For some reason, which seem to have no scientific basis, but can be understood in a historical context (Felsenstein 2001, 2004: chapter 10), cladistics has been synonymised with parsimony analysis, one of the optimality criteria to infer one-dimensional graphs reflecting a series of dichotomous splits (phylogenetic trees). A basic assumption of cladistic studies is that a clade in a parsimony-inferred tree equals a monophylum (which is not necessarily the case, see e.g. Scotland & Steel 2015 for binary data).

In palaeontology (and systematic biology to some degree) it is common not to show a phylogram, a phylogenetic tree with branch-lengths, but a cladogram. These cladograms rarely depict the optimised (or one of the equally optimal) tree(s), but instead show the strict consensus tree of the found equally parsimonious trees (or potentially most-parsimonious trees) (MPTs). This is also the case for the study by Tschopp et al., used here as an example of the generally non-treelike data used in studies dealing with extinct groups of animals.

David provided a list of questions for exploratory data analysis (EDA), which can (and should) be asked when trying to infer phylogenies based on morphological data. I will look at some of them here.

First question: Are the data tree-like?

The data matrix of Tschopp et al. is impressive (much like the paper itself, with its 298 pages). The authors scored 477 characters (243 new) for (a final set of) 81 “operational taxonomic units” (OTUs). The OTUs are typically specimens in the case of the ingroup, and include several outgroup species for rooting the phylogenetic tree. There are lots of gaps in of the matrix (65% missing data), which relates to the inclusion of poorly known fossil specimens, which the authors tried to classify using parsimony inference and pairwise distances. The authors note (p. 163): “Given the low consistency index (CI) and thus high number of homoplasies in the dataset, an additional analysis with the same settings was conducted using implied weighting (iw).” In addition to signal ambiguity related to general homoplasy and ontogeny, the authors note character overlap effects and deformation (pp. 166ff). So, there are quite a few different sources of incompatible, non-treelike signal.

With equal weighting and including all 81 OTUs, the authors ended up with 60,000 equally parsimonious trees (possibly more — this was the maximum number limited by computational constraints). This produced a strict consensus (SC) tree with just 12 nodes, in which “all ingroup specimens formed one large polytomy”. The ‘implied weighting’ lead to a slightly more resolved SC tree. ‘Implied weighting’ is a posterior means to downweigh characters conflicting with the inferred tree. The authors further identified some (4, 8, or 15) OTUs accounting for most of the “instability”. A posteriori filtering of these putative rogue taxa led to SC trees that were much better resolved (Fig. 1).

Fig. 1 The six strict consensus trees shown by Tschopp et al. The red crosses indicate the OTUs that were pruned from the MPT tree sample to increase the resolution of the SC tree. For the first tree, I added the information on the fraction of missing data (blue dots).

Both tree-like and non-treelike data can collapse strict consensus trees, but the large number of MPTs can be a first indication that the data are not tree-like. The MPT samples inferred by Tschopp et al. are not included in the documentation (following the current standard; see also data uploaded to TreeBase). Using the quick-analysis option in PAUP* (random heuristic search, 100 replicates, CHUCK-options set), I found 3,000 equally parsimonious trees, which are only slightly worse (1983 steps) than the 60,000 MPTs (1979 steps reported) combined in Tschopp et al.’s unweighted cladogram.

Using the consensus network approach (Holland & Moulton 2003) for summarising the parsimony-tree sample (no cut-off value), we can get a first impression of the signal in the matrix (Fig. 2). The data allow for a great number of topological alternatives — they are generally not tree-like. Only a few relationships are unambiguous in this collection. The fan-like topological features (composed typically of low-dimensional boxes) relate to: (a) jumping OTUs (rogue taxa), (b) uncertainty regarding relationships between related OTUs consistently found in the same subtree, and (c) the exact composition of the subtrees. In contrast to the strict consensus tree, the network visualises the tree-unlikeliness of the data expressed in the MPT collection, revealing extremely ‘rogue’-ish OTUs (e.g. Diplodocus_YPM_1922) and OTUs with indiscriminate signal (e.g. FMNH_P25112), and also allows us to qualify the ‘rogueness’ of all other OTUs.

Fig. 2 Strict consensus network (all edge-lengths set to 1) of 3000 equally parsimonious trees, inferred from Tschopp et al.'s matrix. This graph is the network equivalent of the commonly seen strict consensus cladograms (Fig. 1). Note that the tree sample is slightly suboptimal and likely incomprehensive.

One pre-inference measure for tree-likeness is the Delta Value (DV) introduced by Holland et al. (2002); see e.g. Auch et al. (2006) and Göker & Grimm (2008) for applications. The matrix DV is 0.47, which is very high, even for a morphological matrix. The individual DVs (iDV) range between 0.417 and 0.577, which means that no set of OTU provides a tree-like signal. The complete data are not tree-like, and hence the failure to find unambiguous relationships, even when a comprehensive tree search and ‘implicit weighting’ are used (see Tschopp et al. 2015). Extreme iDV (> 0.55) correlate with (relatively) high proportions of missing data (75–98%, i.e. 10–119 defined characters; Fig. 3), indicating that missing data are a problem for inferences and the calculation of the pairwise distance matrix.

Fig. 3 XY-plot showing the individual Delta Values (a measure for treelike signal) in relation to the proportion of missing data. The green "comfort zone" indicates iDVs favorable for tree-inference (based on personal experience).

Subsequent question: Why are the data not tree-like?

In his post, David listed four possible reasons for non-tree-like data:
  (a) uninformative data: a “bush”,
  (b) weakly tree-like data: a “tree obscured by vines”,
  (c) data containing several strongly incompatible relationships: a “structured network”,
  (d) confusing or random data: a “spider-web”.
Lacking branch-lengths, the MPT consensus network above provides no information regarding (a), and limited information regarding (b) and (c). Only (d) can be excluded as a main source of non-tree-like signal for the dinosaur data: higher-than-3-dimensional boxes are rare.

Fig. 4 Boostrap (BS) consensus network based on 10,000 BS (pseudo)replicates. Trivial splits in grey, splits without strong alternatives in blue, conflicting splits (always two alternatives) in red. All splits found in less than 20% of the BS replicates not shown, and edge length are proportional to the split frequencies.

Figure 4 shows the bootstrap support network based on 10,000 parsimony bootstrap pseudoreplicates (generated following Müller 2005). Some terminal sister relationships seen in the original, taxon-reduced, unweighted or weighted SC trees rely on quite robust, unconflicted signal, a few others are only supported by a small fraction of the characters, but all competing alternatives even less (blue edges in the graph). Thus, it is a “Maybe” for (a) (see also Fig. 5), and a “Yes” for (b) (compare Figs 2 and 4). The character suites of many OTUs provide no robust signal to place them; their position in the set of trees is based on the signal of relatively (large matrix!) few characters, or the result of branching artefacts as we force non-treelike data into a tree. The robust signal for some terminal clades may be obscured by ambiguous signal of potential additional members of the clade, or OTUs similar to only part of a clade (the “vines”).

We can also observe some pronounced 2-dimensional boxes: here the signal from the data matrix has no preference for a single alternative, but indicates two competing alternatives (red edges in the graph), i.e. also a possible “Yes” for (c). In the case of morphological data, reticulate signals do not necessarily indicate reticulation in an evolutionary sense. They can be triggered by two (more or less related) lineages evolving into the same morphospace, or the co-existence of ancestral and derived forms (see also this post). No spider-web-like portions (high-dimensional boxes) are seen (and are also largely missing from the MPT consensus network in Fig. 2), so we can exclude chaotic signal as reason (d) for the tree-unlikeliness of the data.

Fig. 5 Neighbour-net splits graph based on pairwise (Hamming) distances computed with PAUP* using the Tschopp et al. matrix.

Figure 5 shows the unfiltered, simple (Hamming) distance-based neighbour-net (NNet) for the same matrix. Mirroring the high matrix DV and iDVs, the NNet has only a few tree-like portions, but nevertheless reflects a high diversity — long terminal edges; pairwise distances range between 0 (no difference in data-covered characters) and 1 (all characters are different). Some OTUs are placed closed to or in the boxy centre of the graph or the root trunks of terminal groups. Such a placement is either indicative of ancestry (see my earlier post), which is a special case of reason (c), or a lack of discriminative signal, i.e. reason (a) for non-treelike data. Here, it appears to be mostly the latter: the iDV are high, and the highest iDV relate to high proportions of missing data (more than 75%).

High proportions of missing data do not necessarily result in high DV (here 75% missing data equals c. 150 defined characters, which could be more than enough to place a taxon). But not a few OTUs have zero pairwise-distances to a set of diverse OTUs that are not closely related. In total, 74 of the 81 OTUs show a zero-distance to at least one other OTU; with Diplodocus YPM 1922 (98% missing data) being the most-extremely non-distinct OTU: it has a zero-distance to 66 OTUs, including one outgroup taxon. Such a pattern is impossible from an evolutionary point of view (even an ancestor cannot be identical to all of its off-spring when they diversified). and is a missing data artefact. The NNet resolves this data insufficiency by placing the highly ambiguous OTUs in the centre of the graph, whereas parsimony (or other tree inference) deals with this effectively unsolvable problem by providing some, many, or all theoretically possible placements of the problematic OTU (the OTU turns ‘rogue’) as equally optimal (large fans in Fig. 2) but without support (Fig. 4).

There are two options to infer phylogenetic trees, or to test alternative evolutionary hypotheses using Tschopp et al.’s matrix with its tree-unlike data.
  1. One is to reduce the taxon set to those OTUs with less than 50% of missing data, to produce a backbone tree or network (matrix DV = 0.28; iDV range between 0.219–0.352; Fig. 6), Then  to evaluate the position (or possible positions) of each other OTU within this backbone (using ‘+1 OTU’ neighbour-nets, parsimony-optimisation or algorithms such as the evolutionary placement algorithm implemented in RAxML; Berger & Stamatakis 2010; Berger, Krompass & Stamatakis 2011). Then finalise with group-restricted taxon and character subsets to study within-group relationships.
  2. The other is to cut the matrix into pieces and taxon sets with good data overlap. Then assess the correlation between these submatrices (e.g. using Pearson’s correlation coefficient) and their tree-likeness (using Delta Values). Then use consensus networks and/or supernetworks to investigate potential incongruences, and to summarise topological alternatives.

Fig.6 Neighbour-net (NNet) for a taxon-reduced set, only including OTUs with more than 50% of defined characters. These data result in a single most-parsimonious tree, which is largely congruent to the main splits in the NNet (blue), except for a three poorly supported branches (red). Numbers indicate neighbour-joining and parsimony bootstrap support for branches in the MPT and corresponding edges in the NNet and their alternatives.

Palaeontologists: Please stop using strict consensus trees, and start with EDA

To fill the deeper parts of the Tree of Life with life, we cannot get around morphological data and phylogenetic inferences based on these data. Most of Earth’s diversity is extinct, so their molecular data are (largely) lost to science. But no matter whether we work with extinct plants or animals, or with matrices containing many or few morphological characters, we should keep a close eye on the primary signals in those matrices. Are the data tree-like? Are there rogue taxa, and how/why do they affect the inferences? How discriminatory are the data regarding competing alternative hypotheses? Does taxon and character sampling matter? Networks (planar or n-dimensional) can help to: (1) assess the potential of the data for tree inference, and (2) discuss the putative monophyly of groups and their alternatives.

The signal from morphological data matrices is complex, and the data are rarely tree-like. Irrespective of whether one wants to stick with parsimony or not, tree-based and support consensus networks should by now have long replaced the strict (or majority-rule) consensus trees in “cladistic” or general-phylogenetic studies dealing with extinct groups of organisms.

Posteriori methods to filter or down-weight characters not fitting the inferred tree(s) ignore the fact that morphological differentiation typically cannot be explained by a single tree (leaving aside, that total evidence and DNA-constrained analysis demonstrate that morphological evolution is not parsimonious at all). There are too many sources of signal incompatible with the true tree.

In the light of ambiguous and potentially biased signals (outlined and discussed by Tschopp et al. 2015 for their data), the focus of cladistic or other phylogenetic studies that aim to fill the Tree of Life with extinct branches cannot be to infer a clean(ed) tree. Instead, the focus should be on exploring the signals in the data and assessing their capacity to exclude or support evolutionary scenarios. A well understood topological uncertainty is always better than a poorly supported clade.

Regarding the Tree of Life, we should start representing uncertainty as-is (i.e. showing the currently competing alternatives), and reserve polytomies for cases where we really have no idea at all. Also, we should place potential ancestors (ancestral forms) where they belong: at the root nodes of their descendant lineages (the forms derived from them).


Auch AF, Henz SR, Holland BR, Göker M. (2006) Genome BLAST distance phylogenies inferred from whole plastid and whole mitochondrion genome sequences. BMC Bioinformatics 7:350.

Berger SA, Krompass D, Stamatakis A. (2011) Performance, accuracy, and web server for evolutionary placement of short sequence reads under Maximum Likelihood. Systematic Biology 60:291–302.

Berger SA, Stamatakis A. (2010) Accuracy of morphology-based phylogenetic fossil placement under Maximum Likelihood. IEEE/ACS International Conference on Computer Systems and Applications (AICCSA). Hammamet: IEEE. p. 1-9.

Felsenstein J. (2001) The troubled growth of statistical phylogenetics. Systematic Biology 50:465–467.

Felsenstein J. (2004) Inferring phylogenies. Sunderland, MA, U.S.A.: Sinauer Associates Inc.

Göker M, Grimm GW. (2008)General functions to transform associate data to host data, and their use in phylogenetic inference from sequences with intra-individual variability. BMC Evolutionary Biology 8:86.

Haeckel E. (1866) Generelle Morphologie der Organismen. Berlin: Georg Reiner.

Hennig W. (1950) Grundzüge einer Theorie der phylogenetischen Systematik. Berlin: Dt. Zentralverlag.

Holland B, Moulton V. (2003) Consensus networks: A method for visualising incompatibilities in collections of trees. In: Benson G, and Page R, eds. Algorithms in Bioinformatics: Third International Workshop, WABI, Budapest, Hungary Proceedings. Berlin, Heidelberg, Stuttgart: Springer Verlag, p. 165–176.

Holland BR, Huber KT, Dress A, Moulton V. (2002) Delta Plots: A tool for analyzing phylogenetic distance data. Molecular Biology and Evolution 19:2051-2059.

Müller KF. (2005) The efficiency of different search strategies for estimating parsimony, jackknife, bootstrap, and Bremer support. BMC Evolutionary Biology 5:58.

Scotland RW, Steel M. (2015) Circumstances in which parsimony but not compatibility will be provably misleading. Systematic Biology 64:492–504. [preprint]

Tschopp E, Mateus O, Benson RBJ. (2015) A specimen-level phylogenetic analysis and taxonomic revision of Diplodocidae (Dinosauria, Sauropoda). PeerJ 3:e857.

Post-script: Why distance-based approaches?

Distance-based approaches may be still refuted by hard-core cladists as “unphylogenetic” or “phenetic” (again, see Felsenstein 2004 for the historical reasons, and why this is wrong), particularly when acting as anonymous reviewers of palaeontological papers. But the simple fact is: a character matrix not allowing inference of a pairwise distance matrix with at least some tree-like signal, should not be used to infer phylogenetic trees (no matter which optimality criterion is used).

A perfect character matrix, i.e. a matrix in which each dichotomy is subsequently followed by one or several strictly synapomorphic changes will, of course, result in a single MPT. But it will also provide a simple (Hamming) mean distance matrix allowing us to infer a neighbour-joining tree fulfilling the least-squares or minimum evolution optimality criteria, and this will be identical to the MPT and a corresponding NNet without any box-like portions. It will also be the most probable topology that can be inferred using maximum likelihood or Bayesian inference.

When different tree inference methods come to substantially different results for morphological matrices, the signal from the primary matrix is likely not to be tree-like, and internal conflict then needs to be explored. The more tree-like is the matrix, then the less it will be affected by methodological differences (e.g. Fig. 6; the only branches of the MPT not fitting the preferred splits in the NNet have low support, and compete with equally low supported splits seen in the NNet that receive high support from NJ-bootstrapping).

Distance-based analyses are much faster than parsimony, maximum likelihood, and Bayesian inferences; and they are not restricted to inferring phylogenetic trees. Within the same time that I need to perform a comprehensive tree and branch support analysis, I can generate hundreds of NNets using different taxon and character subsets of my matrix, and thus explore its many signals. One can employ different distance measures to deal with continuous or ordered categorical data, and then directly see the effect on the reconstruction. Eventually, one may find a subset that provides the most tree-like signal, which will be the best possible basis for the final tree-inference (in case an evolutionaru tree is what is wanted) and branch support analysis.

Tuesday, August 22, 2017

Unattested character states

In an earlier post from January 2016, I argued that it is important to account for directional processes when modeling language history through character-state evolution. In previous papers (List 2016; Chacon and List 2015), I  tried to show that this can be easily done with asymmetric step matrices in a parsimony framework. Only later did I realize that this is nothing new for biologists who work on morphological characters, thus supporting David's claim that we should not compare linguistic characters with the genotype, but with the phenotype (Morrison 2014). Early this year, a colleague introduced me to Mk-models in phylogenetics, which were first introduced by Lewis (2001)) and allow analysis of multi-state characters in a likelihood framework.

What was surprising for me is that it seems that Mk-models seem to outperform parsimony frameworks, although being much simpler than elaborate step-matrices defined for morphological characters (Wright and Hillis 2014). Today, I read that a recent paper by Wright et al. (2016) even shows how asymmetric transition rates can be handled in likelihood frameworks.

Being by no means an expert in phylogenetic analyses, especially not in likelihood frameworks, I tend to have a hard time understanding what is actually being modeled. However, if I correctly understand the gist of the Wright et al. paper, it seems that we are slowly approaching a situation in which more complex scenarios of lexical character evolution in linguistics no longer need to rely on parsimony frameworks.

But, unfortunately, we are not there yet; and it is even questionable whether we will ever be. The reason is that all multi-state models that have been proposed so far only handle transitions between attested characters: unattested characters can neither be included in the analyses nor can they be inferred.

I have pointed to this problem in some previous blogposts, the last one published in June, where I mentioned Ferdinand de Saussure, (1857-1913), who postulated two unattested consonantal sounds for Indo-European (Saussure 1879), of which one was later found to have still survived in Hittite, a language that was deciphered and shown to be Indo-European only about 30 years later (Lehmann 1992: 33).

The fact that it is possible to use our traditional methods to infer unattested sounds from circumstantial evidence, but not to include our knowledge about them into phylogenetic analyses, is a huge drawback. Potentially even greater are the situations where even our traditional methods do not allow us to infer unattested data. Think, for example, of a word that was once present in some language but was later completely lost. Given the ephemeral nature of human language, we have no way to know this, but we know very well that it easily happens when just thinking of some terms used for old technology, like walkman or soon even iPod, which the younger generations have never heard about.

Colleagues with whom I have discuss my concerns in this regard are often more optimistic than I am, saying that even if the methods cannot handle unattested characters they could still find the major signal, and thus tell us at least the general tendency as to how a language family evolved. However, for classical linguists, who can infer quite a lot using the laborious methods that still need to be applied manually, it leaves a sour taste, if they are told that the analysis deliberately ignored crucial aspects of the processes and phenomena they understand very well. For example, if we detect that some intelligence test is right in about 80% of all cases, we would also abstain from using it to judge who we allow to take up their studies at university.

I also think that it is not a satisfying solution for the analysis of morphological data in biology. It is probably quite likely that some ancient species had certain traits which later evolved into the traits we observe which are simply no longer attested anywhere, either in fossils or in the genes. I also wonder how well phylogenetic frameworks generally account for the fact that what the evidence we are left with may reflect much less of what was once there.

In Chacon and List (2015), we circumvent the problem by adding ancestral but unattested sounds to the step matrices in our parsimony analysis. This is of course not entirely satisfactory, as it adds a heavy bias to the analysis of sound change, which no longer tests for all possible solutions but only for the ones we fed into the algorithm. For sound change, it may be possible to substantially expand the character space by adding sounds attested across the world's languages, and then having the algorithms select the most probable transitions. But given that we still barely know anything about general transition probabilities of sound change, and that databases like Phoible (Moran 2015)  list more than 2,000 different sounds for a bit more than 2,000 languages, it seems like a Sisyphean challenge to tackle this problem consistently.

What can we do in the meantime? Not very much, it seems. But we can still try to improve our methods in baby steps, trying to get a better understanding of the major and minor processes in linguistic and biological evolution; and not forgetting that, although I was only talking about phylogenetic tree reconstruction, in the end we also want to have all of this done in network approaches.

  • Chacon, T. and J.-M. List (2015) Improved computational models of sound change shed light on the history of the Tukanoan languages. Journal of Language Relationship 13: 177-204.
  • Lehmann, W. (1992) Historical linguistics. An Introduction. Routledge: London.
  • Lewis, P. (2001) A likelihood approach to estimating phylogeny from discrete morphological character data. Systematic Biology 50: 913-925.
  • List, J.-M. (2016) Beyond cognacy: Historical relations between words and their implication for phylogenetic reconstruction. Journal of Language Evolution 1: 119-136.
  • Moran, S., D. McCloy, and R. Wright (eds) (2014) PHOIBLE Online. Max Planck Institute for Evolutionary Anthropology: Leipzig.
  • Morrison, D.A. (2014) Are phylogenetic patterns the same in anthropology and biology? bioRxiv.
  • Saussure, F. (1879) Mémoire sur le système primitif des voyelles dans les langues indo-européennes. Teubner: Leipzig.
  • Wright, A. and D. Hillis (2014) Bayesian analysis using a simple likelihood model outperforms parsimony for estimation of phylogeny from discrete morphological data. PLoS ONE 9.10. e109210.
  • Wright, A., G. Lloyd, and D. Hillis (2016) Modeling character change heterogeneity in phylogenetic analyses of morphology through the use of priors. Systematic Biology 65: 602-611.

Tuesday, August 15, 2017

Is reticulation as important in rice as in wheat?

I have previously discussed the use of phylogenetic networks to study the Complex hybridizations in wheat, due to the very reticulate evolutionary history. It seems that the situation for the other major world food source, rice, also requires network analysis, although this time introgression is the biological source of reticulation, rather than hybridization.

Jae Young Choi, Adrian E. Platts, Dorian Q. Fuller, Yue-Ie Hsing, Rod A. Wing, and Michael D. Purugganan (2017) The rice paradox: multiple origins but single domestication in Asian rice. Molecular Biology & Evolution 34: 969-979.

The authors note:
The Asian rice Oryza sativa is the world’s most important food crop, and is a staple for more than one-third of the world’s population. Oryza sativa is genetically differentiated into several groups, the main ones being japonica and indica, which have been considered as subspecies / subpopulations with distinct morphological and physiological characteristics

The origin of domesticated Asian rice has been a contentious topic, with conflicting evidence for either single or multiple domestication of this key crop species. We examined the evolutionary history of domesticated rice by analyzing de novo assembled genomes from domesticated rice and its wild progenitors. Our results indicate multiple origins, where each domesticated rice subpopulation (japonica, indica, and aus) arose separately from progenitor O. rufipogon and / or O. nivara.

We also show that there is significant gene flow from japonica to both indica (c. 17%) and aus (c. 15%), which led to the transfer of domestication alleles from early-domesticated japonica to proto-indica and proto-aus populations. Our results provide support for a model in which different rice subspecies had separate origins, but that de novo domestication occurred only once, in O. sativa ssp. japonica, and introgressive hybridization from early japonica to proto-indica and proto-aus led to domesticated indica and aus rice.
Similar reticulation histories have, of course, been reported for most domesticated organisms (see Are phylogenetic trees useful for domesticated organisms?), including dogs, cattle, horses, sheep, grapes, etc.

Tuesday, August 8, 2017

Where to retire - a network analysis

I am an elderly man, and it is getting towards time to retire. But where?

I could retire back in Australia; but, as Thomas Wolfe said: "You can't go home again." I could retire in Sweden, but the tax authorities are likely to then take 25% of my pension, which I need to be living on, instead. So, where to go?

This is a question that has occupied the minds of many people, for themselves as well as others; and so, inevitably, you will find web sites on the matter. For example, Live and Invest Overseas has a Retire Overseas Index, recommending particular places, which it updates annually; and International Living has a similar Annual Global Retirement Index.

To help me in my decision, let's look at the International Living data, The World’s Best Places to Retire in 2017. This site provides a rating (out of 100) of ten important characteristics, for 24 countries that might be of interest to retirees:
  • Benefits & discounts
  • Buying & renting
  • Climate
  • Cost of living
  • Entertainment & amenities
  • Fitting in
  • Health care
  • Healthy lifestyle
  • Infrastructure
  • Visas & residence
For 2017, the individual scores vary from 57-100, with "Benefits & discounts" and "Cost of living" varying the most between countries, and "Fitting in" and "Health care" varying the least.

The ten scores for each country can be averaged, to provide a rank ordering of the 24 countries. These average scores vary from 73.3 to 90.9, as shown in the first graph.

There is little to choose between the first three countries in terms of their average score (Ecuador, Mexico, Panama), nor between the next three (Colombia, Costa Rica, Malaysia). But this does not make these countries intrinsically equal. After all, both Panama and Ecuador handsomely outdo Mexico on "Benefits & discounts", while Mexico does better on "Cost of living". I need an analysis that takes into account which characteristics differ between the countries.

This is where a network analysis comes in handy, as a tool for exploratory data analysis. As usual in this blog, I have calculated the Manhattan distance pairwise between the countries; and I am displaying this in the next figure using a NeighborNet network. Countries that have similar retirement characteristics are near each other in the network; and the further apart they are in the network then the more different are their characteristics.

The countries are color-coded by geography, which shows that their actual location has little effect on the Retirement Index. However, the European countries are gathered at the bottom-left, without any representative from Asia. The six top-ranked countries are all clustered in the bottom-right of the network.

Next to this top-rank cluster come Portugal and Spain on one hand, and Nicaragua on the other. These three countries have similar Retirement Scores, but they are separated in the network because Nicaragua scores poorly on "Infrastructure" and "Health care", but better than Europe on "Cost of living", "Buying & renting" and "Healthy lifestyle".

Spain does better than Portugal on "Entertainment & amenities"!

All in all, Portugal look like a good bet to me. The Live and Invest Overseas site lists individual places to retire, not just countries, and for the past three years it has recommended the Algarve region in Portugal as the top location.

Importantly, the Portugese also won't tax my pension (Pension i Portugal ger skattefria miljoner), although the Swedish government is not happy about this, of course (Skattefrihet ska stoppas: Portugal till förhandlingsbordet).

Tuesday, August 1, 2017

Stacking neighbour-nets: a real-world example

In my last post, I outlined two ideas about how stacking neighbour-nets can assist in tracing evolutionary change over time, using a theoretical example. In this post, I will show how this could work using a (tricky) real-world example: a morphological matrix including a high proportion of fossil taxa and a good deal of (strongly) homoplasious characters (Bomfleur, Grimm & McLoughlin 2017).

Stacking can be valuable when both fossil and extant taxa are included in the study. The idea of stacking is to construct networks for each time slice, rather than creating one giant network that tries to encompass everything. Adjacent time-slice networks can then be directly compared, which should reveal the evolutionary changes that occurred between those two times. The final phylogeny can then be constructed from this information, including all of the extant taxa and fossils together.

I regard our work as quite innovative for a palaeobotanical/-phylogenetic systematic study, as it generated a taxon-dense dataset down to species (sometimes individual specimens) as ‘operational taxonomic units’ (OTUs). Our goal was to provide a unifying classification for extant and fossil Osmundales (royal ferns) rhizomes. The primary purpose is hence not to infer a phylogenetic tree but to assist in describing and placing new-found rhizome fossils in the phylogeny. The placement workflow (see this tutorial) combines a polytomous key (using conserved, lineage-diagnostic traits) with neighbour-nets that use different taxon sets. We discussed odd placements in the splits graphs, and matrix signal quality (robustness) from differential branch support, as estimated by non-parametric bootstrapping (least-squares, maximum likelihood, maximum parsimony).

Sources of incompatible data patterns in real-world data

The main problem with real-world data when it comes to inferring phylogenetic relationships, i.e. estimating the true phylogeny, are incompatible data patterns. For molecular matrices, the two main sources of signals that will be incompatible with the true phylogeny are back-mutations and model-bias. For instance, there is usually a higher probability for transitions than for transversions; and for coding gene regions, the 3rd codon position can become over-saturated and thus stochastically distributed, providing little phylogenetic signal. By adapting the model in a probabilistic environment, we can (try to) counter such biases during inference

In the case of morphological (or other non-molecular) traits, incompatible signals arise from:
  1. homoplasious characters – traits that evolve convergently or in parallel, which are frequently included in such matrices;
  2. epigenetic effects – morphological traits not, or not fully, controlled by the genetic composition of the organism; and
  3. pseudo-homologies – traits that are seemingly the same but are the endpoint of different evolutionary pathways.
Inferring a tree reflecting the true phylogeny from such a matrix may be very difficult or even impossible. For a perfect probabilistic approach, we would need to establish character-wise probabilities for change, which requires that a lineage has a modern-day diversity fairly matching that in the past.

Fossils add further sources of signals incompatible with the true phylogeny, such as: preservation artefacts and misinterpretations (false homologies); uncertainty linked to heterochrony; and, last but not least, ‘temporal’ convergences, i.e. the parallel or convergent evolution of the same (or similar) trait in an ancient sister or unrelated lineage of a modern (or much younger) lineage.

For all of these aspects, the royal fern rhizomes provide a nice example (i.e. a bad-case scenario). Only a few of the 45 scored traits that can be observed in fossil material are conserved within the modern lineages and their extant representatives, and hence are of high diagnostic value for assigning fossils to one of these lineages. Many other rhizome features are variable within extant members of the now six genera (some even within a species), and increasingly so looking back into the past.

The royal ferns became arborescent several times, as reflected by convergent adaptations in rhizome anatomy — highly complex stele architectures are found from the Permian onwards in (morpho)species that differ in all relatively stable, lineage-diagnostic traits. The most complex modern-day rhizomes have anatomies that appear to be less derived than those of some of their ancient counterparts. Nonetheless, the rhizomes, scored for 129/130 OTUs (fossil species, partly referring to individual specimens) in our matrix (click here for an annotated version for use with Mesquite), reflect a substantial past diversity and cover more than 250 million years of evolution.

Basic data situation

The all-inclusive neighbour-net (Fig. 1; see here for a fully annotated version) captures aspects of similarity patterns related to phylogenetic relationships, but does not clearly resolve the known (modern) or putative (extinct) genera within the core group Osmundoideae, for example. Overall branch-support is generally low for any alternative (details can be found here), independent of the optimality criterion used. [For our systematic treatment, we used data subsets to generate a series of networks including only members of the same (putative) lineage, which were increasingly proficient to sort the OTUs.]

The main problems are: (i) the differentiation between less-derived rhizome anatomies of the Osmundoideae found in the likely paraphyletic extinct genus Millerocaulis (pink in Fig. 1) and the modern genus Claytosmunda (magenta, paraphyletic with one survivor); and (ii) the distinctness and superficial similarity of two arborescent lineages, the genus Osmundacaulis (red) and the extinct (Permian to Jurassic) family Guaireaceae (greenish). They differ in all stable, lineage-diagnostic characters but share highly dissected steles. Phylogenetic trees "resolve" this conflict by creating an artificial clade (e.g. the parsimony cladogram by Wang et al. 2014). The neighbour-net (Fig. 1) places Osmundacaulis between the Guaireaceae and the Osmundoideae, the subfamily of Osmundaceae including the surviving modern genera.

Fig. 1. Neighbour-net based on a morphological distance matrix of 122 OTUs representing Permian to extant Osmundales and their putative relatives, the Grammatopteridales (black).

Stacking procedure one: identifying closest relatives in subsequent time-slices

Signal ambiguity (from homoplastic characters and the related resolution issue) affects also the time-wise networks to some degree. Figures 2–4 show the network-per-time-slice stacks. Each neighbour-net includes only the OTUs from one stratigraphic period (Permian, Triassic, Jurassic, Cretaceous, Paleogene + Neogene) and the modern-day survivors. For simplicity, links are only established for the closest potential relative in the subsequent or preceding time-slice; and only shown when the mean morphological distance (MD) does not exceed 0.25. The colouring of the dots reflects the systematic affinity of the taxon as established by Bomfleur et al. and shown in Fig. 1.

A major taxonomic turnover characterises the transition from the (late) Permian to the Triassic (Fig. 2). The most primitive (rhizome-wise) Osmundales, the Thamnopterioideae (brown) become extinct, and are completely replaced by the Osmundoideae, their modern counterparts. The only representative of the Permian diversity remaining in the Triassic appears to be Millerocaulis (?Palaeosmunda) stipabonnetiorum, and this may provide a good taxon for rooting the Triassic phylogeny. However, it also one of the worst-preserved and most poorly described taxa — to some degree, its similarity with both lineages of Permian Osmundaceae (Thamnopterioideae and Palaeosmunda) may hint that the distances are under-estimated, since traits could not be scored that otherwise lead to increased distances.

Fig. 2. Taxon-reduced neighbour-nets, including only species from the same time-slice (as labelled). Inter-time-slice links indicate the morphologically closest match in the preceding or following time-slice for each species (in case of pairwise distances < 0.25)

The Jurassic graph (in Fig. 2) highlights a decrease in overall diversity, despite the much higher numbers of OTUs. The links can help to establish relationships between congeners of both time scales; but for Osmundastrum (today represented by a single, genetically and morphologically derived species) a more pronounced evolutionary shift is indicated: the Triassic putative member is linked to Jurassic Millerocaulis species (a paraphyletic Osmundoideae genus defined by the absence of a trait found in all extant genera), which are relatively close to the first unambiguous Osmundastrum. We also find that the three Jurassic newcomers have little relation to the Triassic basis (Fig. 2).

The linking of the Jurassic and Cretaceous time-slices highlights (Fig. 3) a general weakness of the approach using this matrix: poorer preserved, incompletely described fossils included in the matrix (Cretaceous Millerocaulis) attract most links from the Jurassic Osmundoideae — their distances are under-estimated.

Fig. 3. As above, but linking the Jurassic and subsequent Cretaceous neighbour-nets. Note the decreasing diversity but clear signals for Osmundacaulis (red) in contrast to the group of modern Osmundoideae (purplish). Plenasium (light blue) is a modern arborescent genus with complex and highly dissected steles and generally derived rhizomes.

The two Osmundastrum, which are probably part of the same evolutionary lineage, are not linked (see Bomfleur, Grimm & McLoughlin 2015 for the reasons). Two modern lineages with more or strongly derived rhizomes appear in the Cretaceous, the Todinae and Plenasium.

In the case of the Todinae the Jurassic links are partly ambiguous, with one Cretaceous OTU linked to Jurassic Claytosmunda (part of the Todinae’s sister clade according to molecular data), but the other with some relatively distinct Millerocaulis. The problem here is that the Todinae may have diverged earlier (Bomfleur, Grimm & McLoughlin 2015; Grimm et al. 2015), but their rhizome fossils have so far not been found (or lack the diagnostic characters of the lineage). Gaps in the fossil record can hinder establishing meaningful links. The links are, however, to a group of Millerocaulis that are closer to coeval Claytosmunda – which show a rhizome anatomy that may be closest to that of the common ancestor of all modern-day king ferns – than to their congeners. In the case of Plenasium, the genus with the most-derived rhizomes of all modern Osmundaceae, the closest older relative is part of the same subgroup of Millerocaulis. These potentially false links may reflect that some Millerocaulis show derived character suites, which are typically found also in one or another modern Osmundaceae genus (similarity due to convergence).

The closer we get to the modern-day situation, the more interpretable the links become (Fig. 4). Lineages with distinct and derived rhizome anatomies such as Osmundastrum and Plenasium are linked across time-slices. Cross-generic links from Cretaceous Millerocaulis to Paleogene-Neogene Osmunda to modern-day Claytosmunda relate directly to higher numbers of shared, possibly primitive characters in the connected taxa; these links can again be informative for rooting the graphs. Substantially weaker links (mean morphological distances > 0.1 between time-slices) are found for distantly related pairings (Cretaceous and extant Todinae with Paleogene-Neogene Osmundastrum and Claytosmunda).

Fig. 4. As above, but for Cretaceous to modern-day.

Stacking procedure two: graphs including taxa of two subsequent time-slices

Figures 5 and 6 show the two-adjacent-time-slices-per-graph stacks. Interpretation of these figures is more straightforward — one just compares the placement of the connecting taxa (Triassic and Jurassic in Fig. 5; Paleogene and Neogene in Fig. 6). The resolution issue regarding the relationship between Millerocaulis and genera representing the modern lineage (Claytosmunda, Osmundastrum, Plenasium, Leptopteris, Todea) is obvious — the Triassic Millerocaulis are clustered in the Permo-Triassic graph, but are placed apart within the spider-web-like portion in the Triassic-Jurassic graph (Fig. 5). This could mean that several lineages of Millerocaulis diversified in the Jurassic, all of which have their roots in the Triassic. Some of the emerging Millerocaulis groups remain coherent in the Jurassic-Cretaceous graph (and can include Cretaceous species), put their position relative to each other can change. In contrast, for Osmundacaulis the Cretaceous newcomers simply fit into the existing organisation.

Fig. 5. Stack of neighbour-nets comprising species of two subsequent time-slices, covering the time from the Permian to the Cretaceous. Connections relate to Triassic (lower half) or Jurassic (upper half) species that are included in two subsequent splits graphs.

The transition from the Cretaceous to the modern-day situation (Fig. 6) fairly reflects what could be inferred by mapping morphological characters onto the molecular tree. The placement of Osmunda species in the graphs reflect evolutionary change towards the modern-day species, whereas stasis can be assumed for Osmundastrum, and a loss of diversity for Claytosmunda. According to the structures of the graphs, the modern-day Plenasium (subgenus Plenasium) replaced the more diverse (and partly more derived) Cretaceous-Paleogene Plenasium (subgenus Aurealcaulis); but the genus is absent from the Neogene, so there are no connections between the ‘65–5 Ma’ and ‘last 25 Ma’ graphs.

Fig. 6. As above, but covering the time from the Cretaceous to now. Connections refer to Paleogene (lower half) and Neogene (upper half) species.

Now that it’s done, what can be said?

Establishing similarity links across time-slices can be tedious or even misleading, especially with increasing numbers of taxa and increasing complexity of the signals in the matrix (Figs 2–3). The process is more time-consuming and the result (Figs 2–4) is graphically more challenging than the alternative stacking procedure (Figs 5–6).

With most real-world data, it may be difficult to get a set of links between time slices that reflect the true phylogeny, like it did in my earlier theoretical example. Nonetheless, the procedure can help to identify potential relatives (ancestors, descendants, sister lineages) of groups that are restricted to a single time slice, or highlight the lack of potential or favourable candidates.

However, in general, joining the taxa from two subsequent time-slices in one graph, and connecting these graphs by the shared taxa, seems to be a more feasible and straightforward approach. Once a matrix is compiled, the distance calculation and splits-graph inference is a matter of minutes, and it takes less than half-an-hour to produce a first graphical output using the graphical functions in SplitsTree and software to graphically stack the exported SVG or EPS files (further beautification may take a day). Taxa with odd signals (with ambiguous affinity) will be placed accordingly in the nets and eventually move around in the two containing graphs (Fig. 5) and the amount of evolutionary change across time may be directly visible (Fig. 6).

Additional links for readers interested in details

Figure illustrating the history of taxonomic systems for Osmundales.
— An archive including all analysis files generated in the course of the original study is hosted at the Dryad Digital Repository.
— Further annotated versions of the figures shown in this post and the used analysis files have been published under a CC-BY licence: Grimm G. (2017) Osmundales diverstity through time: stacking networks. figshare. https://doi.org/10.6084/m9.figshare.5255014.v1.


Bomfleur B, Grimm GW, McLoughlin S (2015) Osmunda pulchella sp. nov. from the Jurassic of Sweden—reconciling molecular and fossil evidence in the phylogeny of modern royal ferns (Osmundaceae). BMC Evolutionary Biology 15: 126.

Bomfleur B, Grimm GW, McLoughlin S (2017) The fossil Osmundales (Royal Ferns)—a phylogenetic network analysis, revised taxonomy, and evolutionary classification of anatomically preserved trunks and rhizomes. PeerJ 5: e3433.

Grimm GW, Kapli P, Bomfleur B, McLoughlin S, Renner SS (2015) Using more than the oldest fossils: Dating Osmundaceae with the fossilized birth-death process. Systematic Biology 64: 396-405.

Huson DH, Bryant D (2006) Application of phylogenetic networks in evolutionary studies. Molecular Biology & Evolution 23: 254-267.

Maddison WP, Maddison DR (2001 onwards) Mesquite: a modular system for evolutionary analysis.

Wang S-J, Hilton J, He X-Y, Seyfullah LJ, Shao L (2014) The anatomically preserved Zhongmingella gen. nov. from the Upper Permian of China: evaluating the early evolution and phylogeny of the Osmundales. Journal of Systematic Palaeontology 1: 1-22.