Tuesday, February 28, 2017

Models and processes in phylogenetic reconstruction

Since I started interdisciplinary work (linguistics and phylogenetics), I have repeatedly heard the expression "model-based". This expression often occurrs in the context of parsimony vs. maximum likelihood and Bayesian inference, and it is usually embedded in statements like "the advantage of ML is that it is model-based", or "but parsimony is not model-based". By now I assume that I get the gist of these sentences, but I am afraid that I often still do not get their point. The problem is the ambiguity of the word "model" in biology but also in linguistics.

What is a model? For me, a model is usually a formal way to describe a process that we deal with in our respective sciences, nothing more. If we talk about the phenomenon of lexical borrowing, for example, there are many distinct processes by which borrowing can happen.

A clearcut case is Chinese kāfēi 咖啡 "coffee". This word was obviously borrowed from some Western language not that long ago. I do not know the exact details (which would require a rather lengthy literature review and an inspection of older sources), but that the word is not too old in Chinese is obvious. The fact that the pronunciation comes close to the word for coffee in the largest European languages (French, English, German) is a further hint, since the longer a new word has survived after having been transplanted to another language, the more it resembles other words in that language regarding its phonological structure; and the syllable does not occur in other words in Chinese. We can depict the process with help of the following visualization:

Lexical borrowing: direct transfer
The visualization tells us a lot about a very rough and very basic idea as to how the borrowing of words proceeds in linguistics: Each word has a form and a function, and direct borrowing, as we could call this specific subprocess, proceeds by transferring both the form and the function from the donor language to the target language. This is a very specific type of borrowing, and many borrowing processes do not directly follow this pattern.

In the Chinese word xǐnǎo 洗脑 "brain-wash", for example, the form (the pronunciation) has not been transferred. But if we look at the morphological structure of xǐnǎo, being a compound consisting of the verb "to wash" and nǎo "the brain", it is clear that here Chinese borrowed only the meaning. We can visualize this as follows:
Lexical borrowing: meaning transfer

Unfortunately, I am already starting to simplify here. Chinese did not simply borrow the meaning, but it borrowed the expression, that is, the motivation to express this specific meaning in an analogous way to the expression in English. However, when borrowing meanings instead of full words, it is by no means straightforward to assume that the speakers will borrow exactly the same structure of expression they find in the donor language. The German equivalent of skyscraper, for example, is Wolkenkratzer, which literally translates as "cloudscraper".

There are many different ways to coin a good equivalent for "brain-wash" in any language of the world but which are not analogous to the English expression. One could, for example, also call it "head-wash", "empty-head", "turn-head", or "screw-mind"; and the only reason we call it "brain-wash" (instead of these others) is that this word was chosen at some time when people felt the need to express this specific meaning, and the expression turned out to be successful (for whatever reason).

Thus, instead of just distinguishing between "form transfer" and "meaning transfer", as my above visualizations suggest, we can easily find many more fine-grained ways to describe the processes of lexical borrowing in language evolution. Long ago, I took the time to visualize the different types of borrowing processes mentioned in the work of (Weinreich 1953[1974]) in the following graphic:

Lexical borrowing: hierarchy following Weinreich (1953[1974])

From my colleagues in biology, I know well that we find a similar situation in bacterial evolution with different types of lateral gene transfer (Nelson-Sathi et al. 2013). We are even not sure whether the account by Weinreich as displayed in the graphic is actually exhaustive; and the same holds for evolutionary biology and bacterial evolution.

But it may be time to get back to the models at this point, as I assume that some of you who have read this far have began to wonder why I am spending so many words and graphics on borrowing processes when I promised to talk about models. The reason is that in my usage of the term "model" in scientific contexts, I usually have in mind exactly what I have described above. For me (and I suppose not only for me, but for many linguists, biologists, and scientists in general), models are attempts to formalize processes by classifying and distinguishing them, and flow-charts, typologies, descriptions and the identification distinctions are an informal way to communicate them.

If we use the term "model" in this broad sense, and look back at the discussion about parsimony, maximum likelihood, and Bayesian inference, it becomes also clear that it does not make immediate sense to say that parsimony lacks a model, while the other approaches are model-based. I understand why one may want to make this strong distinction between parsimony and methods based on likelihood-thinking, but I do not understand why the term "model" needs to be employed in this context.

Nearly all recent phylogenetic analyses in linguistics use binary characters and describe their evolution with the help of simple birth-death processes. The only difference between parsimony and likelihood-based methods is how the birth-death processes are modelled stochastically. Unfortunately, we know very well that neither lexical borrowing nor "normal" lexical change can be realistically described as a birth-death process. We even know that these birth-death processes are essentially misleading (for details, see List 2016). Instead of investing our time to enhance and discuss the stochastic models driving birth-death processes in linguistics, doesn't it seem worthwhile to have a closer look at the real proceses we want to describe?

  • List, J.-M. (2016) Beyond cognacy: Historical relations between words and their implication for phylogenetic reconstruction. Journal of Language Evolution 1.2. 119-136.
  • Nelson-Sathi, S., O. Popa, J.-M. List, H. Geisler, W. Martin, and T. Dagan (2013) Reconstructing the lateral component of language history and genome evolution using network approaches. In: : Classification and evolution in biology, linguistics and the history of science. Concepts – methods – visualization. Franz Steiner Verlag: Stuttgart. 163-180.
  • Weinreich, U. (1974) Languages in contact. With a preface by André Martinet. Mouton: The Hague and Paris.

Saturday, February 25, 2017

Blog anniversary: 5 years

The first post was put up on this blog on Saturday, February 25 2012, which makes today the fifth anniversary.

First blog header

By my reckoning, this is the 469th blog post, not all of them written by me, of course; but this makes an average of one post for every 3.9 of the 1,827 days. I have never counted the number of actual words, but if I had ever contemplated that number then I probably would never have started.

Second blog header

It is rather tricky to estimate the readership, because of the number of blog hits that clearly come from robots. However, even trying to take that into account, I get an estimate just short of 500,000 pageviews over the 5 years.

Third blog header

So, thanks to everyone for dropping by. If you ever feel inclined to re-read any of the old posts, then they are grouped roughly by topic in the "Pages" at the top of the right-hand column.

Monday, February 20, 2017

Producing admixture graphs

I have written before about admixture graphs, which are phylogenetic networks that represent reticulations due to introgression:
To date, these graphs have not really been incorporated into the mainstream network literature. Part of the problem has been the rather disparate nature of the admixture literature itself. A paper has recently appeared as a preprint in Bioinformatics that provides a brief introduction to this situation:
  • Kalle Leppälä, Svend Vendelbo Nielsen, Thomas Mailund (2017) admixturegraph: an R package for admixture graph manipulation and fitting. Bioinformatics

There are currently several quite different programs for producing admixture graphs:
  • qpgraph (Castelo and Roberato 2006)
  • TreeMix (Pickrell and Pritchard 2012)
  • AdmixTools (Patterson et al. 2012)
  • MixMapper (Lipson et al. 2013)
  • admixturegraph (see above)
These programs summarize the genetic data in different ways based on genetic drift (eg. the covariance matrix versus so-called f statistics), and construct the graphs in different ways (eg. sequential heuristic building versus a user specified graph). There are also different ways to evaluate the graphs, including fitting the graph parameters using likelihood, and comparing them, including the bootstrap, jackknife, and MCMC.

None of this is ideal. Another problem has been that the graphs are often constructed by hand, and may be needed as input to the programs. However, the biggest limitation is that there are currently no algorithms for inferring the optimal graph topology. This is, of course, the basic problem that needs to be solved for all network construction. To quote the authors with regard to their own R package:
The set of all possible graphs, even when limited to one or two admixture events, grows super-exponentially in the number of leaves, and it is generally not computationally feasible to explore this set exhaustively. Still, we give graph libraries for searching through all possible topologies with not too many leaves and admixture events.
For larger graphs we provide functions for exploring all possible graphs that can be reached from a given graph by adding one extra admixture event or by adding one additional leaf. However, the best fitting admixture graphs are not necessarily extensions of best fitting smaller graphs, so we recommend that users not only expand the best smaller graph but a selected few best of them.
The world of graph-edge rearrangements (NNI, SPR) does not yet seem to have encountered the world of admixture graphs.

Tuesday, February 14, 2017

The evolution of women's clothing sizes

Several years ago I presented a piece about the Evolutionary history of Mazda motor cars, in which I pointed out that what is known in biology as Cope's Rule of phyletic size increase applies to manufactured objects as well as to biological organisms. This "rule" suggests that the size of the organisms within a species generally increases through evolutionary time. Human beings, for example, are on average larger now than they were a few thousand years ago. Furthermore, through time, new species arise to occupy the niches that have been vacated (because the previous organisms are now too big to fit).

This situation is easy to demonstrate for cars, because all successful car models get bigger through time — the customers indicate that the car is not quite big enough, and the manufacturer responds. Some examples are illustrated in Car sizes through the years.

Another simple example is women's clothing, which I will discuss here.

Women's clothing changes through time in response to two factors in the modern world: changes in the "desired" image of women (as discussed in the post on Changes in Playboy's women through 60 years), and increasing obesity in western society (see the post on Fast food and diet). Illustrating Cope's Rule in this case is thus easy.

There have been five voluntary "standards" developed over the past century for standardized clothing sizes in the USA, as discussed in Wikipedia. These standards describe, for example, what sized woman should fit into a Size 12 in terms of various of her dimensions. There is nothing mandatory about these standards, and they simply reflect societal recommendations at any given time. So, a Size 12 in 1958 is not the same as a Size 12 in 2008.

These three graphs illustrate the time course of the changes in each of the defined clothing sizes (Size 0 to Size 20), in terms of three female girth measurements.

This is blatantly Cope's Rule in all three cases. All of the sizes get bigger through time, at approximately the same rate. Furthermore, as the dimensions increase through time, new sizes appear to fit the smaller women — Size 8 did not exist in 1931, Size 6 did not exist in 1958, Sizes 2 and 4 did not exits in 1971, and sizes 0 and 00 did not exist in 1995.

To put it another way, a Size 12 woman today is much larger than her Size 12 mother was, who in turn was bigger than the Size 12 grandmother. I believe that this is referred to in the clothing business as "vanity sizing", which it may well be, but it is also a natural example of Cope's Rule of phyletic size increase.

Finally, there is no reason to expect that this phyletic size increase will stop any time soon. Do cars or clothes have an upper limit on their size? Biological organisms do, mainly because of the effect of gravity, and so the phyletic size increase either ceases or the species becomes extinct. Manufactured objects are different.

Data sources
  • DuBarry / Woolworth (1931-1955) - see Wikipedia
  • National Institute of Standards and Appeals (1958) Commercial Standard CS215-58: Body Measurements for the Sizing of Women's Patterns and Apparel Table 4
  • National Institute of Standards and Appeals (1971) Commercial Standard PS42-70: Body Measurements for the Sizing of Women's Patterns and Apparel Table 4
  • ASTM International (1995, revised 2001) Standard D5585 95 (R2001)
  • ASTM International (2011) Active Standard D5585 11e1: Standard Tables of Body Measurements for Adult Female Misses Figure Type, Size Range 00–20

Tuesday, February 7, 2017

Networks, trees and sequence polymorphisms

One of the more obvious bits of evidence that an organismal history may not be entirely tree-like is the presence of sequence polymorphisms. For example, intra-individual site polymorphisms in ITS sequences create considerable conflict in a dataset, if we try to construct a tree-like phylogeny.

This means that people have adopted a range of strategies to try to get a nice neat tree out of their data. This topic is briefly reviewed in this recent paper:
Agnes Scheunert and Günther Heubl (2017) Against all odds: reconstructing the evolutionary history of Scrophularia (Scrophulariaceae) despite high levels of incongruence and reticulate evolution. Organisms Diversity and Evolution in press.
The authors discuss the following strategies, for which they also provide a few literature references.

1. Delete the offending taxa

Pruning the offending taxa is among the most-used tactics. This deletes part of the phylogeny, of course.

2. Delete the polymorphisms

Excluding the polymorphic alignment positions is probably the most common tactic. Similar strategies include the replacement of the polymorphisms with either a missing data code or the most common nucleotide at that position. All of these ideas resolve the polymorphisms in favor of the strongest phylogenetic signal, and thus sweep the conflicting signals under the carpet.

3. Select single gene copies

The polymorphisms become apparent because there are multiple copies of the gene(s) concerned, and therefore selecting a single copy removes the polymorphisms. This can be done by cloning the gene (at the time of data collection), or by statistical haplotype phasing methods (during the data analysis). This also sweeps the conflicting signals under the carpet..

4. Code the polymorphisms

As a preferred alternative, rather than discarding or substituting the sequence variabilities, we could include them as phylogenetically informative characters. This would allow the construction of a phylogenetic network, as well as a tree-like history.

One possibility, suggested by Fuertes Aguilar and Nieto Feliner (2003), concentrates on Additive Polymorphic Sites (APS). A sequence site is an APS when each of the nucleotides involved in the polymorphism can also be found separately at the same site in at least one other accession. Other intra-individual polymorphisms are ignored. This approach has been used to detect hybrids, for example.

An alternative, as used by Scheunert and Günther Heubl to study reticulate evolution in their paper, uses 2ISP (Intra-Individual Site Polymorphisms). All IUPAC codes, including polymorphic sites, are treated as unique characters, by recoding the complete alignment as a standard matrix, which is then analyzed using a multistate analysis option for categorical data. The authors actually use the ad hoc maximum-likelihood implementation from Potts et al. (2014), with additional adaptation of a method for bayesian inference based on Grimm et al. (2007).

You can check out these papers for details.


Fuertes Aguilar J., Nieto Feliner G. (2003) Additive polymorphisms and reticulation in an ITS phylogeny of thrifts (Armeria, Plumbaginaceae). Molecular Phylogenetics and Evolution 28: 430-447.

Grimm G.W., Denk T., Hemleben V. (2007) Coding of intraspecific nucleotide polymorphisms: a tool to resolve reticulate evolutionary relationships in the ITS of beech trees (Fagus L., Fagaceae). Systematics and Biodiversity 5: 291-309.

Potts A.J., Hedderson T.A., Grimm G.W. (2014) Constructing phylogenies in the presence of intra-individual site polymorphisms (2ISPs) with a focus on the nuclear ribosomal cistron. Systematic Biology 63: 1-16.