Archive

Archive for the ‘Uncategorized’ Category

Pickling, uneval, unapply

I see that I am referenced on this wiki page without a link. The purpose of this post is to explain what I remember of the reason for this reference. Chris Webber prompted me to write this up during some recent conversations about object capabilities. I came up with the ‘unapply‘ idea while I was doing some interpreter work with Chris Fry around 2002, and talked to Mark Miller about it a bit later.

Most programming language implementations use very different representations of data objects when they occur as text in a file as compared to when they occur as patterns of bits and pointers in a computer’s memory. The names (or identifiers) in the program serve a function similar to the pointers in memory, but the pointers are followed to their target much more quickly by the computer hardware.

The conversion from in-memory form to text is usually called “serialization” these days and the inverse “deserialization”. Some serialization/deserialization services work with human-readable text (e.g. as in Lisp), while others go to a harder-to-read “binary” form. The latter kind was called “pickling” in PARC’s Cedar, if I remember correctly [but see addendum below], and later was called “serialization” in Java, a term that confused me since I always thought serialization had to do with concurrency.

Many languages have features superficially similar to unparse/parse or serialize/deserialize but not invertible. The “print methods” of many Lisp systems, Python, and so on, and Java’s toString method, give neither assistance nor enforcement for invertibility since the code for the print method can emit whatever characters it wants. In Python the question never even comes up because there is no practical parsing utility in the first place.

In the case of Lisp 1.5, PRINT is an inverse of READ (at least with respect to EQUAL; structure sharing is not detected), and subsequent Lisp dialects have preserved PRINT/READ invertibility for certain data types such as symbols and lists.

The conversion from file to in-memory form is accomplished by parsers similar to Lisp READ. Of course many data objects are born in memory and are not the result of parsing, and these are ones we usually care about because we often want to write them out (pickle them) and then read back a copy (unpickle).

OK. There’s a lot to be said about pickling/unpickling and I don’t have much to contribute. I just want to mention one detail.

The central design component is the interface between the generic pickling structure traverser, and the particular data types that might need special handling. There are many reasons one might want custom pickling and unpickling, including special considerations around the “boundaries” of objects (which links to follow), redundant structure that needn’t be kept such as indexes, security considerations, and so on, so there is a lot of motivation to be able to provide object- or type-specific methods for pickling and unpickling. Because of this, some standard interface is needed.

The approach I took was to have a pickling operation that generates an expression in a data construction language (actually a subset of the programming language). Unpickling is then just evaluating the expression that’s the result of pickling. This is not arbitrary evaluation like the ‘#.‘ feature of Common Lisp; the security of unpickling is assured using object capability confinement techniques (see E, W7, etc.).

Now it might seem tempting in this framework to say that the pickler/type interface is an “uneval” method that given an object returns an expression. That is, the type-specific customization method does the pickling directly. But most objects have other objects as parts or links. A custom uneval method would need to have the ability to uneval its parts, and in the interval after receiving the unevaled expression for the part and placing that expression into the larger expression for the whole, the uneval method would have the opportunity to inspect the pickled version of the part, which feels wrong in many different ways (it is not compositional, not secure, not modular, etc.). In addition, if the custom method composes expressions, it has to know something of the syntax of the pickling language, and has many opportunities to make mistakes.

An alternative presents itself that isolates the custom method from any knowledge of syntax or of the representations of its parts. This is for the interface to the custom method be unapply rather than uneval. Unapply is an inverse of apply: it takes an object and returns a procedure and a set of arguments. The procedure and arguments have the property that the procedure applied to the arguments is the object (or a copy or clone of the object).

The pickling harness can use unapply to reduce the problem it’s presented with to that of pickling a set of simpler objects. It can use whatever syntax it likes (coordinated with the unpickler) to express the application of the procedure to the arguments.

An added benefit is that the pickling harness can detect shared structure: it can look at the sub-objects delivered by unapply and determine which ones are shared among multiple parts of the overall object-part-graph being pickled, and generate expressions using constructs similar to let and letrec that re-create the shared structure on unpickling.

The base cases for pickling would be scalars (numbers and so on) together with objects that are required or known to have incarnations in the system doing the unpickling, such as the constructors for the various types. For proper capability discipline, any call to the unpickler has to receive an environment to be used to map the names or tokens designating scalars, such as constructors, to the scalars themselves, which if they’re functions might get invoked to help build up the in-memory representation.

I don’t have a lot more to say about this – I think this is the story I told Mark Miller way back when, that led him to acknowledge me on the “safe serialization” wiki page.

Addendum 2020-12-11: I have no evidence that Cedar used the word “pickle” or had such a facility, so don’t quote me on that. I just checked one of the Cedar tech reports on line and didn’t find it. Modula-2+ had pickles by 1986, and Modula-2+ was created by the same people who created Cedar, so attributing them to Cedar is at least plausible. Certainly the idea was kicking around at Yale (Nat Mishkin, Kai Li) and Xerox (Bruce Nelson’s RPC work) in the early 1980s and probably earlier elsewhere (maybe MacLisp FASDUMP / FASLOAD ??) – it’s not as if it would be difficult to reinvent. (credit due to Paul McJones, John Ellis, and Nat Mishkin for some of this information, thanks)

Categories: Uncategorized

Thoughts on TaxonRelationshipAssertion

Issue: More appropriate name for TaxonRelationshipAssertion

https://github.com/tdwg/tnc/issues/48

(Message composed for the taxon names and concepts interest group; too long to put into the github comment stream, so depositing it here so I can just drop a link.)

TL;DR the vocabulary as it stands is mostly OK, we just need to clarify the documentation.

I didn’t read every word so I apologize if I repeat someone or have missed some important point.

I looked at the draft specification (? the tnu_terms table) and I think it is workable; it just needs to have the underlying theory tightened up a little so that readers are steered away from confusion.

I think the idea of a TaxonomicNameUsage, and that of a TaxonomicName, are excellent.

I think the documentation for parentNameUsage, vernacularName, and preferredName all need to be clarified to emphasize that this information is according to the what the source says (NOT according to the author, who may have changed her/his mind since writing the source!). We need to be very clear that the purpose of this class is to anchor what we say to documentary evidence, and to draw a line between what the source says and how we interpret it. If we want to interpret, we will do so in sources we write, and that will lead to our own TNUs.

The definition of TNU (“operationalization of a taxonomic concept”) is vague and leans too heavily on “taxonomic concept”. It doesn’t help us to know what these things are like, what their properties are, what might be true of them (these are the basic questions of ontology). Morphologically a TNU would have to be a kind of usage. Webster’s says a usage is a generally accepted practice or procedure – an action that somebody takes with something. So one might say “that usage is familiar / unfamiliar / obsolete” (potential properties). Or it “is common in the UK” or “is no longer in vogue”. And we can assume that what is being used is some ‘verbatim name string’. TNU is slightly more granular that this in that TNU 1 and TNU 2, both of the same verbatim name string, might have indistinguishable usage, yet still be different TNUs because the sources are different. This to me does not seem very consequential because we can always imagine that the two references lead to some difference in usage, even if it’s undetectable in the population of ‘users’.

This takes us to TaxonRelationshipAssertion. The word ‘taxon’ has a fraught history and I understand why biodiversity informaticians treat it as radioactive. I read Lam’s paper “What is a taxon” – he claims responsibility for reviving the word in the 1950s, after it languished for twenty-five years or so – and he gives a series of definitions that are wildly in contradiction. He finally settles on saying it has to mean whatever the ICZN code says it means, and that it should not be used in botany.

I am not opposed to using the word ‘taxon’. It is a delightful word if you ignore its baggage. To me the best definition is that taxa are the subject matter of taxonomy, where taxonomy is taken in the sense of biological classification, with nomenclature being an independent pursuit. That is, I want ‘taxon’ to be a biological entity, not a human, administrative entity. We have ‘name’ and ‘TNU’ as good administrative entities, but when you do science you have to interpret names or TNUs as biological entities.

In particular, we don’t want the situation where a group for millions of years is not a taxon, and then suddenly, when an article describing it is published, it becomes a taxon. Or, to have a manuscript submitted for review be rejected for calling a group a taxon, just because there has not yet been a publication that describes the group.

So, nothing that a human can do (other than ecological manipulation such as extirpation) should be able to affect anything we say about a taxon (in this sense).

Because I will probably be attacked or misunderstood for trying to define ‘taxon’ this way (as a biological entity), I will, for now, just use the word ‘group’. (However… I see that the ‘TNU terms’ table seems to use ‘taxon’ in just this way – see definition of TaxonRelationshipAssertion. If people here like ‘taxon’ I celebrate. anyhow.)

By ‘group’ I don’t mean the mathematical notion of ‘set’, or Lam’s other candidate meaning of natural grouping based on characters; I mean a group that a competent taxonomist might circumscribe. I don’t know if we can, or need to be, more precise than that.

Given a word for the biological groupings we care about, this lets me talk about ‘TaxonRelationshipAssertion’.

What is going on is not that TNUs are groups, but that they are interpreted as designating groups. We want to be able to claim that if t1 and t2 are TNUs, then the groups that we interpret t1 and t2 to be are equivalent, or satisfy some other RCC-5 relation, etc. etc. It is not the TNUs that are equivalent or whatever, it is the groups.

We don’t need to put groups into the vocabulary to accomplish this. We just have to change the documentation of some terms to make it clear what’s going on.

And by the way the word ‘concept’ has no place here at all. Concepts are theoretical entities that live inside human minds – or perhaps in linguistic communities, who knows – nobody really knows what they are. Luckily we don’t need know. Our subject matter is not psychology or linguistics, and does not include concepts. http://ontology.buffalo.edu/bfo/BeyondConcepts.pdf

OK, we were talking how to express a claim that two groups (which might actually be only one group) are related in some way. I agree with Markus that ‘assertion’ is redundant: we assert what we believe (conjecture, calculate, repeat, etc.) just by saying it. The important thing here is not the asserting, which is ubiquitous in the document at hand, but the relationship between the groups. If we document that the groups in question (those whose relationship we’re claiming) are the ones that the subjectTNU and objectTNU are *about*, that the TNUs’ purpose is to tell us which groups we’re talking about, then everything will be clearer.

(I have taken to use a ‘relationship’ as something that holds between two particular individuals, and a ‘relation’ to be a general pattern connecting individuals of various classes. This is a technical distinction that is probaby not in widespread use.)

So ‘group’ would only occur in definition strings, and possibly in GroupRelationship (or TaxonRelationship, or whatever term you/we choose as long as it has nothing to do with names or concepts).

Now I started this all off by saying that TaxonRelationshipAssertion was not a good name because it suggested a relationship between taxa, rather than between TNUs. I contradict myself and I am sorry. The relationship is ultimately between taxa/groups, but it is expressed with the assistance of the TNUs, and therefore induces an incidental relationship between the TNUs. Suppose t and u are TNUs, and I is the interpretation function that takes TNUs to groups (that is: I(t) is the group that t is about). Then there really are two relationships, say R1 and R2, where R1 holds between the TNUs R1(t,u), and R2 holds between the groups R2(I(t), I(u)). I was thinking of R1 when I complained, but the purpose of the ‘assertion’ is to make an R2-type claim, so that is why I’m now thinking it’s OK for the connective to talk about the groups/taxa instead of the TNUs.

This doesn’t preclude the addition of Group or Taxon as a class in the vocabulary. I don’t have an opinion yet on whether that would be desirable. I know it’s been discussed in this comment thread but I have not followed the argument in detail.

Addendum about ‘circumscription’: A circumscription is an act of circumscribing, and circumscribing is something that people do to help other people know what group (or similar bounded entity) they are talking about. So circumscriptions are not groups, and do not belong to the realm of biology. If we look at a piece of writing (or a video of a lecture, etc.) we may find words that we would call a ‘circumscription’, but that is not quite accurate – it is just a very common metonymy where we confuse some words with what they say. A similar case is ‘contract’ where most people freely mix up the contract itself, which is an agreement or ‘meeting of minds’, with its written record. These are different things with different properties. Closely related, yes, but not the same.

In writing all this I’m trying to set an example of a certain way of talking about and choosing words. In technical writing, in my opinion, it is best to avoid metonymy and to agree with ordinary language, unless there’s a very good reason not to (and from time to time there is). Avoiding metonymy means to use different words for different things. Consistent with broader usage in society ought to be easy to judge, if not to do. It often helps to just look at a dictionary, but in difficult cases a corpus analysis or literature review might be called for.

Categories: Uncategorized

What good are taxonomic ranks?

It is sometimes said that taxonomic ranks – genus, family, class and so on – are useless and should be discontinued.

(See for example: Animal Diversity Web, Phylocode)

I agree that in isolation, saying that a taxon “is a class” or “is an order” tells you nothing. One often sees statements like “90% of all families went extinct” as if “family” actually meant something. It doesn’t, because whether a group is designated a “family” or “class” is an almost arbitrary choice of the individual preparing a classification. The choice of rank may persist because taxonomists respect precedent, but when it’s assigned in the first place or under revision, it’s a free for all.

Sometimes people try to associate rank with historical divergence time, so that all “classes” diverge from one another before any “orders” diverge from each other and so on, but my opinion is that (a) this is not current practice and (b) such a project is both doomed and not very helpful. Above the level of species, it is hard to do it wrong.

However those who would dispose of ranks entirely and say they’re useless are missing a function of ranks: they give hints as to relationships between taxa. I’m not saying it’s an important function – maybe it is, or not – but it is a function.

It is axiomatic that if two taxa are given the same rank in a classification, then they are disjoint. They have no members in common. Nothing can belong to two families or two orders. No reasonable classification has a family contained in a family. Before such a classification is created, the rank of one or the other is changed so that a taxon’s direct child has a different rank from the taxon itself.

Having different ranks does not guarantee that taxa are not disjoint, so the inference only goes in one direction. But when the ranks are the same, disjointness can be inferred just by looking at ranks, with no need to examine the hierarchy.

A second, less important but not useless, function of ranks is that ranks, being ordered, can rule out some possible inclusions. We know that an order can never be subsumed by a genus (within one classification). So we can tell that a given order is not in a given genus just by looking at the ranks, with no need to examine the hierarchy.

So while designated ranks may be of limited utility, they are not useless.

Disjointness and non-subsumption are important pieces of information in any kind of reasoning about collections of taxa. Of course this information can be gleaned by looking at the hierarchy. But having designated ranks gives a shortcut that saves time. We get faster code, and a lighter cognitive load for humans.

From this point of view, ranks are just arbitrary tokens that can be applied within a classification to sets of disjoint taxa at any level. “Family”, “class”, and other tokens no longer mean anything at all except in relation to the tokens given to other taxa. Two taxa that share a token are disjoint. They may not be necessary to the identity (specification) of a taxon, but they can provide a shorthand for relationships to other taxa that bears on the identity of a taxon. If you are wondering what taxon is meant by ‘A’, it might help you to know that A is disjoint from B.

I’d like to say why I was thinking about this – it has to do with RCC-5 and the representation of hierarchies, especially those that include _incertae sedis_ taxa – but this post is already long enough.

ADDED 5/24/2019: I never said I was a scholar about this stuff; this is just a blog. Nico Franz has kindly pointed me to his article (with David Thau) “Biological taxonomy and ontology development: scope and limitations” from 2011 which on page 57 basically says the same thing.

Categories: Uncategorized

About phyloreferences

[This post has been reblogged here. If you’re inspired to comment, please do so there.]

Gaurav Vaidya has written two interesting articles on phyloreferences:

The following is my attempt to make some sense of phyloreferences, and re-express the ideas (to the extent I understand them) in my personal idiolect. I make no claims of novelty.

Classification and inference

In any area of study where one deals with lots of things, it is important to discover natural groupings (classes, taxa) of those things. Define a grouping G to be natural if membership of an item X in G helps to predict properties of X, beyond just those properties that led it to being placed in G. That is, if other members of the group have some property P, then X is more likely to have property P than it would be otherwise. (I am being intentionally imprecise; read on.)

Biology certainly has to deal with lots of things, and lots of kinds of things – molecules, alleles, specimens, species, and so on – so for the purpose of prediction, it puts a lot of energy into finding natural classifications.

In the case of evolved entities, groupings that are consistent with evolution are often called out as ones that are likely to be natural. Such a grouping has the property that all of its members descend from some hypothetical founder; such groupings are called monophyletic groups, or clades. The search for natural groupings, and the search for evolutionary history, are not logically related a priori, but the assumption that clades are natural is a sensible heuristic, because properties are for the most part inherited.

(For the sake of focus I won’t talk about the relation of non-hierarchical or recombinant effects, i.e. sex, lateral gene transfer, hybridization, and so on, to classification, although they are undeniably important.)

Phylogenetic trees

Membership in a clade can be difficult to determine, both because we might not know much about the clade’s founder, and because ancestry can be very difficult to work out. Formal methods for obtaining hypotheses of ancestry and relatedness are collectively known as phylogenetic analysis, and its results have been impressive. On the other hand, hypotheses proposed by phylogenetic analysis are sometimes very weak, in which case nobody puts much confidence in them.

Phylogenetic analysis starts with a fixed set S of items, understood to be mutually distinct or disjoint. The items are only a small set of samples among some much larger universe U of items under study. (E.g. 25 individual mammal specimens from museums might be used to infer aspects of the evolutionary history of all mammals.) The output of the analysis is a tree T (‘tree’ in the computer science sense) with tip nodes corresponding to the items, together with internal nodes and a root node. The nodes are connected by arcs, which we can take as directed away from the root of T.

It is conventional to interpret the nodes of T as clades, but additional assumptions are needed to for this to make sense, because appropriate clades may not exist, or there may be many appropriate clades to choose from.

For each node N in T, we can consider the clades C, among all the clades in U, that are ‘compatible’ with N in the following technical sense: C is compatible with N if C contains all of the items in S whose nodes are reachable from N, and excludes all of the items in S whose nodes are not reachable from N. Any node N expresses a hypothesis that there exists (or once existed) a clade compatible with N.

For given N, there may be no compatible clade in U (i.e. N’s hypothesis about evolution may be incorrect). If there is a compatible clade C, there will be many of them, identical in terms of which items from S they contain (and don’t), but differing regarding containment of other items in U.

For given N it is often useful to select a single clade C(N) for use in further analysis. We might be able to get away with saying: “Suppose there are compatible clades; let C(N) be such any such clade,” if the choice doesn’t matter. Or, we might say: “We will decide which clade we care about later, after more information is in” treating C(N) as a variable to be solved for. Other conventional rules for selection are to pick the crown clade (the smallest clade in U compatible with N), the stem clade (the largest clade in U compatible with N), or the compatible unique clade that originates some particular evolutionary innovation (apomorphy).

Item specifier matching

The items in the item set S are not given directly, but rather are specified with bits of writing (identifiers, descriptions, etc.) that we have to interpret, so any use of a phylogenetic analysis in conjunction with other data has to start with scrutiny of those item specifiers. Consider in particular the case of comparing the evolutionary hypotheses expressed by a tree T1 with those expressed by a tree T2, where either their item sets S1 and S2 are different, or their items are specified in different ways, or both. To get a meaningful comparison, the item specifiers in T1 have to be matched with item specifiers in T2, consistent with their respective intended meanings.

I don’t have much to say about how the matching is done. Gaurav suggests using automated ontology-based inference such as OWL DL, and that sounds like a fine idea to me. Given item specifiers I1 from T1 and I2 from T2, the outcome of a match attempt could be that they specify the same item, or different ones; or, if the items are themselves groupings (such as species, as opposed to specimens or DNA samples), we might have a subsumption or non-subsumption overlap relation between the groupings.

When an item specifier match exists and is unique, we are ready to move on. But when we get 1-to-n or n-to-n’ matches, interpretation is harder. Suppose the matching phase says that the items specified by I2a, I2b, and I2c from T2 are subsumed by the item specified by I1 in T1. If there is a node N2 from which I2a, I2b, and I2c are reachable, and no other matched item specifiers are, we can hypothesize the existence of a clade C(N2) that is the same as one of the clades compatible with the I1 node. In harder cases, such as where there is no such node N2, or if matching items overlap without either subsuming the other, one will have to think harder about what to do.

With a completed matching, we are in a position to ask whether, for nodes N1 (in T1) and N2 (in T2), it would be consistent to assume the existence of a clade C compatible with both N1 and N2. If so, it would be natural to say that nodes N1 and N2 are ‘compatible’. With this notion of node compatibility in hand, we can then ask, which nodes N2 in T2 (if any) are compatible with N1?

Phyloreferences

The following definition of ‘phyloreference’ is my own, and perhaps incompatible with the way others use the term.

A phyloreference is an information-thing, perhaps a bit of writing, intended to refer to a clade. A phyloreference consists of:

  1. a nonempty set I of item specifiers (‘in-specifiers’),
  2. a set O of item specifiers (‘out-specifiers’), necessarily nonempty if I is a singleton set,
  3. a clade choice rule: either ‘crown’, ‘stem’, or ‘apomorphy A’ for some A, allowing one to choose a single clade among all the clades containing the items specified by I and not containing the items specified by O.

To connect to the preceding exposition, phyloreferences are effectively nodes in degenerate phylogenetic trees. Given a phyloreference P, let T(P) be the tree defined as follows:

  1. Let T(P)’s root have a child N,
  2. let T(P)’s root also have a child for each item specifier in O,
  3. let N have a child for each item specifier in I.

Now, from the clades compatible with N, let C(P) = the one determined by P’s choice rule (if one exists). C(P) is then the clade that we intend, when we take P to refer to a clade.

Any use of a phyloreference as a reference builds in the implicit hypothesis that such a clade exists, much as 5/y > 2 builds in the hypothesis that y is nonzero.

A phyloreference P may be used to locate a node in a given tree, say T. Assume that all of P’s item specifiers are matched to item specifiers associated with tips of T. Observe that for any clade C, C is compatible with at most one node in that tree. (This is not true in the case of an apomorphy choice rule, where the location of the apomorphy in the tree is not known; dealing with this situation is left as an exercise for the reader.) So we can interpret P to find a unique node of T, when there is one that’s compatible.

Because trees can express incompatible phylogenetic hypotheses, there is no guarantee that a phyloreference locates nodes compatible with the same clade in every tree. There might be an item specified in T1 and T2 but not in P whose specifier is reachable from N1 but not from N2. When we go to find clades for N1 and N2, we will have to choose different clades for them.

An application of phyloreferences?

Unfortunately I still don’t know what phyloreferences are for – for what problem they provide the best available solution. So I will just talk about my own experience with them.

I was interested at one point for using them in the Open Tree of Life project. The problem to be solved was what I like to call “transfer of annotation”. Somebody (or a piece of software) wants to point to a node N1 in tree T1 and say something about it or some related entity, such as a clade. Suppose they want to say A1(N1). A1(N1) might be a comment, a bug report, a citation, a link to a data source, etc. The problem then is what to do with all the annotations when a new tree T2 is published as an improvement on T1. One would like to stick one’s neck out and say: perhaps A1(N2), because A1(N1) and N2 is an awful lot like N1.

Well this looks like a lost cause. ‘An awful lot like N’ doesn’t provide enough predictability for users. If you are talking about nodes in trees, then the truth of what you are saying could depend on just about any detail of the entire tree, or of anyone’s interpretation of the tree. That is, node annotations are, strictly speaking, not transferable at all, lacking further understanding of what the annotation is trying to say. And ‘further understanding’ is usually not something a big data aggregator has.

This attitude is too pessimistic. A better approach is to establish a simple, consistent rule for annotation transfer, and to be transparent about it. We might give each annotation point a rule (call it a ‘locator’) that (1) uniquely selects that node from the tree and (2) will select the node in other trees to which the annotations will be attached.

This means that the annotations are primarily connected with the locator, not with the node. If everyone understands that, and can see what the locators are and how they work, then at least the whole process is transparent and unambiguous.

The more inclusive the locator (fewer constraints), the more annotations will be transferred, possibly creating ambiguities and/or false positives. The more restrictive the locator (more constraints), the more likely an annotation would have no good place to go, and would become lost or difficult to find. There is no way to be perfect.

I wanted locators to be small and easy to use and understand. Each locator would have a unique identifier for easy reference. We called the identifiers ‘node ids’ or ‘OTT ids’ but under this plan they would be locator ids. Annotations would be connected to nodes via locators specified by their locator ids.

Phyloreferences seemed a good tool to use here (although I didn’t know the word at the time). They are easy to understand, easy to compute with, and moderately robust in the sense that using a given phyloreference on two trees is ‘likely’ to yield the same, or compatible, interpretations (clades) relative to the two trees.

So, for example, we might have a node N in tree T1 subtending all and only mammals (mammal-item nodes). We create a phyloreference / locator P for N with, say, I={platypus, koala, rat}, O={garter snake}, A=’whatever it takes to be called a mammal’, and store it. (Yes the name business is cheating but taxon names are often the closest we have to an apomorphy in this kind of bulk informatics.) When we want to use P in the context of ‘improved’ tree T2, we match P’s item specifiers to item specifiers in T2, and if we’re lucky these all match uniquely. Then we can resolve P, i.e. look at nodes in taxonomy T2 that subtend T2’s I-nodes and exclude T2’s O-nodes. This will usually yield a unique node N2 in T2.

If there is no N2, we have a conflict between evolutionary hypotheses, and there is not much to say. If T2 has multiple ‘mammal’ pseudo-apomorphy points, this is a pathological case and should probably be flagged for manual intervention.

[Added 6/15: More on the names business: Automatic bulk phyloreference-to-tree resolution is already heuristic, and names seem a plausible practical cue to use when there is an ambiguity based on I and O. But many nodes don’t have names, and you can easily go wrong using names to match. So it might be better just to stick with the crown or stem rule uniformly. There is a lot of room for improvement in this theory.]

What happened

This design was rejected by the project, and not pursued further. I’m not complaining; I think the theory was not well enough developed at the time to warrant the investment.

One objection was the arbitrariness of the choices of I and O. These sets had to be chosen automatically as we had no way to manually review phyloreferences for over 100,000 internal nodes. In my prototype I used heuristics to guess at which items (usually species) were most likely to persist into the future. It was hard to figure out how big I and O should be (2 items? 5? 30? arbitrarily large?).

Another objection was to their summary nature. We have the entire trees T1 and T2, the thinking went, so why not make use of the entire trees somehow, rather than use a small summary? After all, the use of summaries can lead to false positive matches. Computational feasibility did not seem to be a very principled reason.

Another objection was identifier (and therefore annotation) instability as the members of I and O ‘moved’ across the ‘apomorphy’. It is useful to be able to transfer annotations for groups whose membership changes, but if the summary I and O contain items that move, then the old locator no longer locates a node in the new tree. For example, if a cockroach is in the O-set for termites, then a new tree putting cockroaches inside of termites would not have a node for the termite locator; this would be unfortunate since it is our understanding of cockroach evolution, not the apomorphy for termites, that has changed. Termites are ‘unchanged’, even if we were ‘wrong’ before about whether cockroaches are termites.

Open Tree’s multiple taxonomy version sequences would provide an empirical basis for studies that test to see how frequently phyloreferences of this kind “break” (become unresolvable or ambiguous in some trees). If the number of failures is small enough that failures can be processed manually, then perhaps the approach is feasible after all.

Beyond this little study, I do not want to say that there are no uses for phyloreferences; I am confident that the people working on them do so for very good reasons, and that I am just a slouch for not having discovered them. I look forward to hearing how this line of work continues.

Thanks to Gaurav Vaidya for all his help. All errors are mine.

Appendix: A note on terminology

One thing that often irritates me in others’ writing is confusion about whether someone is talking about information-things or biology-things. For example, the word “clade” is used in both ways: sometimes a clade is a node in a tree (information), and sometimes it is a monophyletic group of organisms (biological things). The term “metonymy” describes these switcheroos, which consist of the replacement of a thing (e.g. a clade) with some related thing (e.g. something that can be
interpreted as a clade). In ordinary language, metonymy is very common and people usually make sense of it without noticing it. But when you combine difficulties with knowing what is true (e.g. phylogenetics) with a need for formal rigor (ontology and programming), metonymy is a train wreck. It is too easy to make statements that are confusing to the reader or inference system and that, when implemented computationally, can lead to wrong answers or inconsistency.

(“Clade” has a third use – perhaps the most common – as “clade hypothesis” or “hypothetical clade”, something that might be expressed by an information-thing such as a node. If we are both looking at a phylogenetic tree that groups turtle with weasel putting opossum outside that group, what the tree says makes no difference to whether there is actually a clade containing turtle and weasel but excluding opossum. There isn’t any such clade! We all know that! But there does exist a (false) clade *hypothesis* that claims that there is such a clade. There are things that we might say about the hypothesis, and so on. There is nothing inconsistent about this, unless one uses “clade” metonymously.)

In light of this I’ve tried to use words carefully. An “item” is a biological thing, a real-world entity that you can see or touch or measure or reason about, and that is situated in space and time. So, a pinned insect in a museum, and the DNA in a well in a plate in a wet lab, are items. An “item specifier” is an information-thing, a bit of writing that people can copy from place to place. (To be extra careful one would distinguish each particular copy of the item specifier, situated in space and time, from the common pattern to which all copies conform.)

Part of the reason that “clade” is so difficult to keep straight is that while clades are a biological things, and easily characterized, they are very difficult to know anything about. This difficulty in knowing makes the biological sense of the word almost useless to biologists, who with good reason only want to talk about clade hypotheses, not some unknowable reality.

These distinctions become critical as one moves from natural language to logical frameworks such the Web Ontology Language (OWL), and infinitely more so if there is the possibility of integration with other frameworks. A system that manages on technical grounds to be logically consistent, yet confuses information-things with biological-things, may cause some second system to become inconsistent when the two are combined.

I can’t expect others to follow my lead, but I wish they would. Of course I may have slipped up and used the wrong word somewhere – it is easy to do. And informal prose can become stilted and unpleasant to read if one is too pedantic; e.g. in some contexts “weasel” works much better than “an item specifier for the weasel clade” for communication with humans.

By the way I don’t want “item” to escape the context of this article. I just used it because I was uncomfortable with Gaurav’s “taxonomic unit” and I couldn’t think of anything better than “item”.

Categories: Uncategorized

How many described species are sampled in published phylogenetic trees?

You may have high expectations for this post! Don’t – answering this question in a serious way would take a lot of research. Instead I’m just scrawling a few notes here for reference. Maybe someone else can do the legwork.

I was asked this question and came up with a quick estimate of 10%. Here’s what went into this number (admittedly somewhat post hoc):

The Open Tree project’s phylesystem – a repository of curated phylogenetic trees – samples about 2% of the species that are listed in the taxonomy (OTT) which has 2.2 million tips. That would be about 45,000 species. I’m going from memory here; I did this measurement several years ago and could remember incorrectly. So you’d would have to double-check that. The number might be in one of the Open Tree papers, but it’s easy to compute by traversing the JSON files in phylesystem and checking them against the taxonomy (filtering by rank if you want to be picky).

2.2 million could be high (due to species names with unusably bad descriptions, undetected synonymies, and so on) or low (due to published species not finding their way to any of the inputs to the taxonomy). I won’t touch it for now.

A lot of the studies that are in phylesystem are stored but not curated very well – so we may not have detected all the species that we have in a computationally useable way. Let’s raise the estimate from 2% to 3% to account for that.

Phylesystem has trees for maybe 6,000 published studies (5,000 to 8,000 – memory again). How much of the literature does this cover? I’m sure there are published estimates of the number of published phylogenetic studies but the references aren’t at my fingertips. We know phylesystem is quite incomplete (Drew et al), so 10,000 studies has to be a lower bound. 20,000 seems plausible assuming we’re doing some filtering (e.g. counting only studies that sample at least 3 described biological species – not linguistic phylogenies and so on). I think I’ve heard an estimate of 50,000 – I don’t remember where – and that seems plausible too.

With each additional study, there will be diminishing returns in that the species sampled may also be sampled by other studies. But let’s ignore this, it’s too much complexity for such a crude estimation.

In addition, the number of species per study is clearly not constant. Open Tree may have selectively chosen large or ‘high-yield’ studies for curation, leaving fewer species left to occur in unselected studies. I will ignore this complication too.

So, assuming 3% coverage for every 6,000 published studies, and 20,000 published studies, that would be 10%.

Categories: Uncategorized

She is an adjective

[Post updated 4/16/2019 in various ways to improve civility. Sorry Geoff.]

Comments are closed on your post, so I’ll reply here. You observed the following assertion:

Jane Jacobs has become more than a person. She is an adjective.

and said as follows:

I have absolutely no idea what the blurb-writer could have meant

A web search for “a Jane Jacobs” turns up plenty of examples of “Jane Jacobs” used as an adjective (or as HST points out as a noun in a noun-noun formation), so the evidence of what is intended is out there. Probably the above refers to “a Jane Jacobs walk” which is a kind of walk or stroll. For example, I find: “Anyone can host a Jane Jacobs Walk.”

“Jane Jacobs” is not written with quote marks in “Jane Jacobs … is an adjective”, and I am fully on board with the idea that a person cannot be an adjective. And I agree that using “Jane Jacobs” in a noun-noun formation doesn’t make it an adjective. But it is common in situations like the above to bend the rules for the sake of levity. I don’t like these practices any more than you do, but they happen and are not so hard to figure out.

Categories: Uncategorized

Apple confuses deputy

The MacOS API has something called an NSURLSession object, which relies on a background process (daemon) called ‘nsurlsessiond’ (if I understand correctly). If an application wants to fetch a web page, it uses an NSURLSession object, which pokes the daemon, which pokes the site, which returns the bits to the daemon, which returns the bits to the application.

Because I’m curious and a bit paranoid, I use a program called Little Snitch, which monitors all network requests made by applications. Little Snitch implements a conventional access control system: you can grant or deny any application access to network hosts, ports, etc. If you don’t want, say, gamed making any outward network connection, you can make a Little Snitch rule that prevents that.

So what if a program uses nsurlsessiond to make a connection? Little snitch only knows that nsurlsessiond is asking for network access, so there is no way for it to grant different permissions to the different programs that are _using_ nsurlsessiond. I can’t ask Little Snitch to let, say, icloud make connections, but not to let gamed make connections, because Little Snitch only knows from nsurlsessiond, not gamed or icloud. So I have to allow neither or both.

This is a classic confused deputy scenario. I have to let nsurlsessiond connections through, or I can’t get work done. But when I do so, some evil principals will gain access I didn’t want them to get.

I never noticed nsurlsessiond before a recent upgrade. It’s not a bad programming pattern; a reuseable piece of code is packaged up for general use. In fact deputies (abstractions, services) of this kind are generally considered good practice. The evil comes when use of a deputy ‘launders’ access so that information needed for permission-granting is lost. This particular deputy has a daemon, I guess so that requests from many programs can be coordinated, with the side effect that requests are ‘laundered’.

It may sound like I’m suggesting that the whole authority chain should be preserved through to the point where an access decision is made, so that Little Snitch can see it, but in general that doesn’t work either. There are ways to do this right (capability architecture), but they usually require an overhaul of the code, and a rewrite of the operating system…

Categories: Uncategorized

Oldest US entomological societies?

To: (contact address on entsocpa.org site)

The Entomological Society of Pennsylvania was founded in 1842, yet the American Entomological Society says that the AES, founded 1859, is the “oldest continuously-operating entomological society in the Western Hemisphere.”

What is the explanation of the difference? Did Ent. Soc. of PA suspend operations at some point, or is AES wrong?

There seems to be a lot of confusion about early American entomological societies. A recent article at the Biodiversity Heritage Library blog says that the AES is the oldest entomological society in the U.S., without qualification. This is clearly wrong. (They also neglect the Cambridge Entomological Club, “operating continuously” since 1874, in claiming that NYES was perhaps number three.) I intend to send them a correction.

(I too was guilty at one point of giving out incorrect information about early entomological societies. I’m not sure even now that I know what the first four were.)

Best
Jonathan Rees
former treasurer of the Cambridge Entomological Club

Click to access procesa1957.pdf


“The first truly entomological society in America was the
the Entomological Society of Pennsylvania formed in 1842”

http://darwin.ansp.org/hosted/aes/about.htm
“The American Entomological Society is the oldest
continuously-operating entomological society in the Western Hemisphere, founded on March 1, 1859.”

http://blog.biodiversitylibrary.org/2016/09/the-new-york-entomological-society.html
“Depending on how you count, the New York Entomological Society (NYES), founded in 1892, is either the second or third oldest entomological society in the U.S. The oldest is the American Entomological Society, founded in 1859 in Philadelphia; the Brooklyn Entomological Society was founded in 1872, but merged with the NYES in 1968.”

http://entsocwash.org/default.asp?Action=Show_SocietyInfo&ID=History
“Entomological societies which preceded ours and which have continued to publish regularly are: 1) The American Entomological Society, 1867, …”

[Update 2016-09-29: The ESP got back to me with the following: “The ESP was founded in 1842, but fizzled in 1844. After a brief 80 year hiatus, it was started back up again in 1924.”]

[Update 2019-04-16:BHL updated its article to reflect this correction.]

Categories: Uncategorized

Digital preservation and independently held copies

[2016-01-25 Title updated to be more specific.]

Some information – writing, data, images – is important enough that it should be preserved and made available for as long as possible. Somebody, 5 or 10 or 50 or 200 years form now, might want or need to look at it. If you care that something be preserved, you will ask yourself what you can do to help bring about preservation.

It’s very easy for an individual, a project, or an organization to say: I am in control of this information, I am a responsible member of the community, and I can be a good steward. I will use the best redundancy technology and keep good backups, so the stuff will be safe from fire, natural disaster, and so on. It will be preserved because I will preserve it. (See e.g. NARA’s codification of being responsible.)

This may be true, up to a point, but it is a delusion. The risk that an individual, project, or organization might suddenly lose its ability to preserve is too great, in my opinion, for this to be an acceptable digital preservation solution by itself. Individuals die or become disabled; projects get canceled by management under budget pressure or changes in priorities; and organizations close or go bankrupt. And everyone is vulnerable to legal and governmental takedowns and censorship, and acts of war. These are all very unlikely events, but over long periods of time, unlikely risks become somewhat likely.

Every preservation plan must therefore include distribution of the information to one or more independent parties that are very likely to survive threats against the original steward. The receiving parties should be organizationally and legally independent of the original steward, and should reside in a different jurisdiction (country). They should keep their copy because they want to, not because they are being paid to.

Someone who gets one of these copies should by ready, if necessary, to make it available for use and perhaps further dissemination and preservation planning.

This is whether we’re talking about Very Important Stuff handled by big well-funded entities, or stuff that’s extremely informal and small-scale. If it’s useful in your community, make sure a friend in another country has a copy.

Oddly, this problem used to be solved, but is now unsolved. During the print era, the natural and economical way to disseminate information was to make lots of copies and get libraries to take them up. Redundancy was a completely natural side effect of copying technology and economics. The Internet works in a completely different way: copies are made on demand (copied from the server to the client) and thrown away. There are content distribution networks (CDNs), but these are ephemeral and dependent (under contractual control of the original steward). We no longer have independent stewards of copies of things because we don’t need to to support our day to day habits.

(If the stuff in question is an active database, the recipient may also choose to continue updating it, or give it to someone else for update coordination, but this is an optional and orthogonal secondary step. The main point is that the information should be preserved, because someone might need to know what it says.)

If the “backup” is to become the new principal steward – and one should always be prepared for this – it will be important to transfer domain names as well. If the original steward is incapacitated, then the backup organization will have to change the DNS records without coordination with the original. That means prior transmission of registrar passwords. Arrangements like these are complicated and fragile, and therefore much rarer than they need to be. An excellent example of organizations doing the right thing in this regard is the coordination between FOAF and Dublin Core.

I was telling this story around 2007 to anyone who would listen, as part of my work for Science Commons. One of the most important infrastructure databases for scholarship is the Crossref DOI metadata – the information that gives you basic bibliographic information for the publication associated with a DOI. At the time I didn’t know whether Crossref was copying its database to an independent foreign partner, and maybe it wasn’t, but by 2010 Crossref had announced backup to Portico, which sounds pretty good to me – Crossref is a UK organization, Portico is a US organization, and neither would be made vulnerable by the other’s legal or financial trouble. The fact that Crossref issued a press release about this tells me that the idea of independent copies is neither obvious nor silly.

Twitter is not a very good way to carry on a conversation, but it has the advantage of being public, which helps keep people honest and responsive. ORCID is a fairly new organization that has an infrastructure database similar to Crossref’s, one that is starting to gain an important role in scholarship. On 18 January I casually asked:

wondering, does @orcid do outside-org outside-country backups like @crossref does (http://www.crossref.org/01company/pr/news111610.html …)?

The answer from @ORCID_org:

@jar346 @orcid @crossref Yes, we have backup servers in countries outside of US.

This didn’t answer my question; to me a “backup server” is something administered by the originating organization, perhaps physically residing in a different locale but not necessarily accessible to any “outside-org” there. And I found nothing on their site to reassure. Rather than continue on twitter I wrote this post. Maybe they will read it and get a better idea of what I was trying to say.

Don’t get me started on copyright.

Categories: Uncategorized

My Quora experiment

Generally I stay away from Quora because of all the inanity there, but I keep going back because there’s just enough good stuff (e.g. Keith Winstein keeps posting there).

After reading one of Philip Greenspun‘s blog posts (I forget which one) I got to thinking about public education. Two peculiar things about it are (a) we pay to send other people’s children to school, even though education seems to be a private benefit (certainly college is considered to be one), (b) we make it illegal for a parent not to. (a) is simply liberal, once you see that education is a public good, not a private one, so not really more puzzling than public investment in roads. But (b) requires some justification since on the surface it sounds like meddling in personal liberty, as well as unnecessary since isn’t education in one’s self interest?

I did some web searches around (b) and didn’t turn up very much. Mostly discussions of public education go to (a), talking about all the benefits of an educated public, and don’t address (b). The best reason I found was that forcing parents to send their kids to school protects the children since it keeps the children from being exploited for their labor in factories, on farms, and so on. Self-interest is not a good evaluation heuristic here because the parents’ interest may be at odds with the child’s interest.

There was also something about integrating the children of immigrants.

Maybe there is so little dissent from compulsory education that nobody questions it. You don’t see picket lines with people shouting “no more education”. As Philip would say, parents like the free day care.

My pet rationale for compulsory education is that it is defensive: children grow up to be voters and jurors, and when we are falsely accused we don’t want to be judged by the ignorant. We have to coerce people to be less ignorant, since otherwise they would choose to be ignorant. That is just a theory. Maybe school helps enlighten students, but public opinion polls would suggest it’s not very successful at it.

I want to emphasize that I’m not being polemical; I’m not asking the question because I have an axe to grind about how children ought to be free to skip school and parents have no responsibility if they do. I’m just looking for an answer to what I thought was an obvious question of political philosophy, and a relatively uncontroversial one given that you don’t hear a lot of fighting about it.

From time to time you do hear people complain about paying property taxes when they don’t have children or when their own children don’t benefit from the local public schools, and it would be nice to have a sensible answer to such complaints.

So I tried Quora. The way I asked was: “Why do we require, and pay, other people to send their children to school?” There were three serious flaws with this way of asking, and as a result the exercise was unproductive.

First, it is two questions; requiring and paying are very different things, as I say above, and they have different rationales. Most of the answers addressed the ‘paying’ part while completely ignoring the ‘requiring’ part. I’ve found something similar with email: if you ask two questions in an email message, the response(s) you get back will invariably answer one or the other but not both. If you have two questions, send two messages.

Second, it does not make clear that I was looking for a rationale for requirement that would decisively overcome the liberty argument.

Third, it talks about sending children to school, when it should be asking about compulsory education – home schooling is perfectly OK. So I got an answer picking at this flaw in the question, without giving me any response to the ‘requiring’ part.

I hope my report of these missteps will be of help to someone else formulating a question for Quora or any similar forum. A better question would have been: “What gives us a moral right to tell others that their children have to get an education?” – that actually helps generate hypotheses, such as uneducated = dangerous (making it similar to the imposition of building codes).

What useful information did I get? Here are excerpts (I am quoting people out of context, go back to quora.com for justice to them):

“The more educated people there are in your world, the larger your pool of potential good friends will be and the more interesting your life will be.” – this goes to (a), not (b).

“Because the collective cost of ignorance to society is far, far more expensive.” – this says why you would want to require education, not why you would have a right to do so.

“Requiring kids to go to school isn’t the only way kids can be educated.” – as I described above, the purpose of this response was to (justly) put in a plug for home schooling.

All the other answers were about benefit to society and why the public pays for education. No quarrel there. One responder taught me the term “merit good”, which was nice.

Did I learn anything? Yes, about how not to post questions to Quora, but not about the question at hand.

[addendum 2016-06-07: This Language Log post contains the kind of information I was looking for: James Garfield in his 1881 inaugural address said “All the constitutional power of the nation and of the States and all the volunteer forces of the people should be surrendered to meet this danger by the savory influence of universal education.” referring the danger that illiteracy poses to the survival of the republic. That is, he says it’s not just a Good Thing, it’s a matter of addressing an existential threat, and therefore necessary.]

Categories: Uncategorized