Archive

Archive for February, 2018

How many described species are sampled in published phylogenetic trees?

You may have high expectations for this post! Don’t – answering this question in a serious way would take a lot of research. Instead I’m just scrawling a few notes here for reference. Maybe someone else can do the legwork.

I was asked this question and came up with a quick estimate of 10%. Here’s what went into this number (admittedly somewhat post hoc):

The Open Tree project’s phylesystem – a repository of curated phylogenetic trees – samples about 2% of the species that are listed in the taxonomy (OTT) which has 2.2 million tips. That would be about 45,000 species. I’m going from memory here; I did this measurement several years ago and could remember incorrectly. So you’d would have to double-check that. The number might be in one of the Open Tree papers, but it’s easy to compute by traversing the JSON files in phylesystem and checking them against the taxonomy (filtering by rank if you want to be picky).

2.2 million could be high (due to species names with unusably bad descriptions, undetected synonymies, and so on) or low (due to published species not finding their way to any of the inputs to the taxonomy). I won’t touch it for now.

A lot of the studies that are in phylesystem are stored but not curated very well – so we may not have detected all the species that we have in a computationally useable way. Let’s raise the estimate from 2% to 3% to account for that.

Phylesystem has trees for maybe 6,000 published studies (5,000 to 8,000 – memory again). How much of the literature does this cover? I’m sure there are published estimates of the number of published phylogenetic studies but the references aren’t at my fingertips. We know phylesystem is quite incomplete (Drew et al), so 10,000 studies has to be a lower bound. 20,000 seems plausible assuming we’re doing some filtering (e.g. counting only studies that sample at least 3 described biological species – not linguistic phylogenies and so on). I think I’ve heard an estimate of 50,000 – I don’t remember where – and that seems plausible too.

With each additional study, there will be diminishing returns in that the species sampled may also be sampled by other studies. But let’s ignore this, it’s too much complexity for such a crude estimation.

In addition, the number of species per study is clearly not constant. Open Tree may have selectively chosen large or ‘high-yield’ studies for curation, leaving fewer species left to occur in unselected studies. I will ignore this complication too.

So, assuming 3% coverage for every 6,000 published studies, and 20,000 published studies, that would be 10%.

Categories: Uncategorized