Apple confuses deputy

The MacOS API has something called an NSURLSession object, which relies on a background process (daemon) called ‘nsurlsessiond’ (if I understand correctly). If an application wants to fetch a web page, it uses an NSURLSession object, which pokes the daemon, which pokes the site, which returns the bits to the daemon, which returns the bits to the application.

Because I’m curious and a bit paranoid, I use a program called Little Snitch, which monitors all network requests made by applications. Little Snitch implements a conventional access control system: you can grant or deny any application access to network hosts, ports, etc. If you don’t want, say, gamed making any outward network connection, you can make a Little Snitch rule that prevents that.

So what if a program uses nsurlsessiond to make a connection? Little snitch only knows that nsurlsessiond is asking for network access, so there is no way for it to grant different permissions to the different programs that are _using_ nsurlsessiond. I can’t ask Little Snitch to let, say, icloud make connections, but not to let gamed make connections, because Little Snitch only knows from nsurlsessiond, not gamed or icloud. So I have to allow neither or both.

This is a classic confused deputy scenario. I have to let nsurlsessiond connections through, or I can’t get work done. But when I do so, some evil principals will gain access I didn’t want them to get.

I never noticed nsurlsessiond before a recent upgrade. It’s not a bad programming pattern; a reuseable piece of code is packaged up for general use. In fact deputies (abstractions, services) of this kind are generally considered good practice. The evil comes when use of a deputy ‘launders’ access so that information needed for permission-granting is lost. This particular deputy has a daemon, I guess so that requests from many programs can be coordinated, with the side effect that requests are ‘laundered’.

It may sound like I’m suggesting that the whole authority chain should be preserved through to the point where an access decision is made, so that Little Snitch can see it, but in general that doesn’t work either. There are ways to do this right (capability architecture), but they usually require an overhaul of the code, and a rewrite of the operating system…

Categories: Uncategorized

Oldest US entomological societies?

To: (contact address on site)

The Entomological Society of Pennsylvania was founded in 1842, yet the American Entomological Society says that the AES, founded 1859, is the “oldest continuously-operating entomological society in the Western Hemisphere.”

What is the explanation of the difference? Did Ent. Soc. of PA suspend operations at some point, or is AES wrong?

There seems to be a lot of confusion about early American entomological societies. A recent article at the Biodiversity Heritage Library blog says that the AES is the oldest entomological society in the U.S., without qualification. This is clearly wrong. (They also neglect the Cambridge Entomological Club, “operating continuously” since 1874, in claiming that NYES was perhaps number three.) I intend to send them a correction.

(I too was guilty at one point of giving out incorrect information about early entomological societies. I’m not sure even now that I know what the first four were.)

Jonathan Rees
former treasurer of the Cambridge Entomological Club
“The first truly entomological society in America was the
the Entomological Society of Pennsylvania formed in 1842”
“The American Entomological Society is the oldest
continuously-operating entomological society in the Western Hemisphere, founded on March 1, 1859.”
“Depending on how you count, the New York Entomological Society (NYES), founded in 1892, is either the second or third oldest entomological society in the U.S. The oldest is the American Entomological Society, founded in 1859 in Philadelphia; the Brooklyn Entomological Society was founded in 1872, but merged with the NYES in 1968.”
“Entomological societies which preceded ours and which have continued to publish regularly are: 1) The American Entomological Society, 1867, …”

[Update 2016-09-29: The ESP got back to me with the following: “The ESP was founded in 1842, but fizzled in 1844. After a brief 80 year hiatus, it was started back up again in 1924.”]

Categories: Uncategorized

Digital preservation and independently held copies

[2016-01-25 Title updated to be more specific.]

Some information – writing, data, images – is important enough that it should be preserved and made available for as long as possible. Somebody, 5 or 10 or 50 or 200 years form now, might want or need to look at it. If you care that something be preserved, you will ask yourself what you can do to help bring about preservation.

It’s very easy for an individual, a project, or an organization to say: I am in control of this information, I am a responsible member of the community, and I can be a good steward. I will use the best redundancy technology and keep good backups, so the stuff will be safe from fire, natural disaster, and so on. It will be preserved because I will preserve it. (See e.g. NARA’s codification of being responsible.)

This may be true, up to a point, but it is a delusion. The risk that an individual, project, or organization might suddenly lose its ability to preserve is too great, in my opinion, for this to be an acceptable digital preservation solution by itself. Individuals die or become disabled; projects get canceled by management under budget pressure or changes in priorities; and organizations close or go bankrupt. And everyone is vulnerable to legal and governmental takedowns and censorship, and acts of war. These are all very unlikely events, but over long periods of time, unlikely risks become somewhat likely.

Every preservation plan must therefore include distribution of the information to one or more independent parties that are very likely to survive threats against the original steward. The receiving parties should be organizationally and legally independent of the original steward, and should reside in a different jurisdiction (country). They should keep their copy because they want to, not because they are being paid to.

Someone who gets one of these copies should by ready, if necessary, to make it available for use and perhaps further dissemination and preservation planning.

This is whether we’re talking about Very Important Stuff handled by big well-funded entities, or stuff that’s extremely informal and small-scale. If it’s useful in your community, make sure a friend in another country has a copy.

Oddly, this problem used to be solved, but is now unsolved. During the print era, the natural and economical way to disseminate information was to make lots of copies and get libraries to take them up. Redundancy was a completely natural side effect of copying technology and economics. The Internet works in a completely different way: copies are made on demand (copied from the server to the client) and thrown away. There are content distribution networks (CDNs), but these are ephemeral and dependent (under contractual control of the original steward). We no longer have independent stewards of copies of things because we don’t need to to support our day to day habits.

(If the stuff in question is an active database, the recipient may also choose to continue updating it, or give it to someone else for update coordination, but this is an optional and orthogonal secondary step. The main point is that the information should be preserved, because someone might need to know what it says.)

If the “backup” is to become the new principal steward – and one should always be prepared for this – it will be important to transfer domain names as well. If the original steward is incapacitated, then the backup organization will have to change the DNS records without coordination with the original. That means prior transmission of registrar passwords. Arrangements like these are complicated and fragile, and therefore much rarer than they need to be. An excellent example of organizations doing the right thing in this regard is the coordination between FOAF and Dublin Core.

I was telling this story around 2007 to anyone who would listen, as part of my work for Science Commons. One of the most important infrastructure databases for scholarship is the Crossref DOI metadata – the information that gives you basic bibliographic information for the publication associated with a DOI. At the time I didn’t know whether Crossref was copying its database to an independent foreign partner, and maybe it wasn’t, but by 2010 Crossref had announced backup to Portico, which sounds pretty good to me – Crossref is a UK organization, Portico is a US organization, and neither would be made vulnerable by the other’s legal or financial trouble. The fact that Crossref issued a press release about this tells me that the idea of independent copies is neither obvious nor silly.

Twitter is not a very good way to carry on a conversation, but it has the advantage of being public, which helps keep people honest and responsive. ORCID is a fairly new organization that has an infrastructure database similar to Crossref’s, one that is starting to gain an important role in scholarship. On 18 January I casually asked:

wondering, does @orcid do outside-org outside-country backups like @crossref does ( …)?

The answer from @ORCID_org:

@jar346 @orcid @crossref Yes, we have backup servers in countries outside of US.

This didn’t answer my question; to me a “backup server” is something administered by the originating organization, perhaps physically residing in a different locale but not necessarily accessible to any “outside-org” there. And I found nothing on their site to reassure. Rather than continue on twitter I wrote this post. Maybe they will read it and get a better idea of what I was trying to say.

Don’t get me started on copyright.

Categories: Uncategorized

My Quora experiment

Generally I stay away from Quora because of all the inanity there, but I keep going back because there’s just enough good stuff (e.g. Keith Winstein keeps posting there).

After reading one of Philip Greenspun‘s blog posts (I forget which one) I got to thinking about public education. Two peculiar things about it are (a) we pay to send other people’s children to school, even though education seems to be a private benefit (certainly college is considered to be one), (b) we make it illegal for a parent not to. (a) is simply liberal, once you see that education is a public good, not a private one, so not really more puzzling than public investment in roads. But (b) requires some justification since on the surface it sounds like meddling in personal liberty, as well as unnecessary since isn’t education in one’s self interest?

I did some web searches around (b) and didn’t turn up very much. Mostly discussions of public education go to (a), talking about all the benefits of an educated public, and don’t address (b). The best reason I found was that forcing parents to send their kids to school protects the children since it keeps the children from being exploited for their labor in factories, on farms, and so on. Self-interest is not a good evaluation heuristic here because the parents’ interest may be at odds with the child’s interest.

There was also something about integrating the children of immigrants.

Maybe there is so little dissent from compulsory education that nobody questions it. You don’t see picket lines with people shouting “no more education”. As Philip would say, parents like the free day care.

My pet rationale for compulsory education is that it is defensive: children grow up to be voters and jurors, and when we are falsely accused we don’t want to be judged by the ignorant. We have to coerce people to be less ignorant, since otherwise they would choose to be ignorant. That is just a theory. Maybe school helps enlighten students, but public opinion polls would suggest it’s not very successful at it.

I want to emphasize that I’m not being polemical; I’m not asking the question because I have an axe to grind about how children ought to be free to skip school and parents have no responsibility if they do. I’m just looking for an answer to what I thought was an obvious question of political philosophy, and a relatively uncontroversial one given that you don’t hear a lot of fighting about it.

From time to time you do hear people complain about paying property taxes when they don’t have children or when their own children don’t benefit from the local public schools, and it would be nice to have a sensible answer to such complaints.

So I tried Quora. The way I asked was: “Why do we require, and pay, other people to send their children to school?” There were three serious flaws with this way of asking, and as a result the exercise was unproductive.

First, it is two questions; requiring and paying are very different things, as I say above, and they have different rationales. Most of the answers addressed the ‘paying’ part while completely ignoring the ‘requiring’ part. I’ve found something similar with email: if you ask two questions in an email message, the response(s) you get back will invariably answer one or the other but not both. If you have two questions, send two messages.

Second, it does not make clear that I was looking for a rationale for requirement that would decisively overcome the liberty argument.

Third, it talks about sending children to school, when it should be asking about compulsory education – home schooling is perfectly OK. So I got an answer picking at this flaw in the question, without giving me any response to the ‘requiring’ part.

I hope my report of these missteps will be of help to someone else formulating a question for Quora or any similar forum. A better question would have been: “What gives us a moral right to tell others that their children have to get an education?” – that actually helps generate hypotheses, such as uneducated = dangerous (making it similar to the imposition of building codes).

What useful information did I get? Here are excerpts (I am quoting people out of context, go back to for justice to them):

“The more educated people there are in your world, the larger your pool of potential good friends will be and the more interesting your life will be.” – this goes to (a), not (b).

“Because the collective cost of ignorance to society is far, far more expensive.” – this says why you would want to require education, not why you would have a right to do so.

“Requiring kids to go to school isn’t the only way kids can be educated.” – as I described above, the purpose of this response was to (justly) put in a plug for home schooling.

All the other answers were about benefit to society and why the public pays for education. No quarrel there. One responder taught me the term “merit good”, which was nice.

Did I learn anything? Yes, about how not to post questions to Quora, but not about the question at hand.

[addendum 2016-06-07: This Language Log post contains the kind of information I was looking for: James Garfield in his 1881 inaugural address said “All the constitutional power of the nation and of the States and all the volunteer forces of the people should be surrendered to meet this danger by the savory influence of universal education.” referring the danger that illiteracy poses to the survival of the republic. That is, he says it’s not just a Good Thing, it’s a matter of addressing an existential threat, and therefore necessary.]

Categories: Uncategorized

Why is Open Tree not publishing RDF?

Question raised on an Open Tree discussion group:

I’m wondering why you are not using RDF as the underlying graph data model and OWL annotations (and other existing ontologies) to create a semantic graph and therefore following the current best practices to build knowledge graphs.

Good question. Partly it’s that only one person on the project knows anything about RDF. But I think this is mainly a matter of cognitive space and time among the developers, and priorities. If we felt a need to do it given the goals that we have, we would probably do it. But we haven’t felt any need.

Converting to RDF and OWL is easy to do poorly (and perhaps adequately for many purposes). One of the first things I did on the project was to convert the taxonomy to turtle so I could load it into a triple store. (I was on the RDF bandwagon for many years.) Anyone could do this; it’s a trivial script. Also, the NeXML format that we use subsumes RDFa Core so can be converted easily – in a sense we *do* publish RDF for the study database.

Doing RDF/OWL well is much harder, and would require cooperation with other groups such as OBO (IAO, VTO, …) and TDWG, choice of and support for persistent URLs, good term definitions and documentation, SPARQL endpoint, and so on. These coordination activities are extremely time consuming. Of course doing so would be lovely in the abstract, but there has been no reason for us to make this a priority.

In my experience, format conversion is by far the easiest activity in data ecology, so mere conversion to RDF has little value. The hard parts are marshalling the data in the first place, and then using it wisely. Due to the vagueness of most vocabulary term definitions, the best laid RDF usually requires as much reverse engineering and postprocessing as data in any other format when doing data integration and analysis. So it is semantics, not syntax, where the effort is best spent. (RDF being a syntactic play, and not helping with semantics any better than any other data format, in spite of the buzzword “semantic web”. OWL helps semantics a little but only with inference, not with ground truth, which is what really matters.)

The feedback captured in the feedback system (in github) has a little structure, and we could probably do better in obtaining more.

The thing that would tip the balance would be a real funded collaboration with another project where there was good reason to use RDF or OWL for communication between the collaborators. Publishing RDF/OWL merely for the sake of doing so is not in my opinion the best use of resources – especially given that all the information is open and anyone else could do such a conversion for us. I read a lot about the size of the linked data cloud, but very little about its utility. I bet there are legitimate uses of RDF-published data, but from what I’ve seen people mostly publish RDF just so that they can say that they did, not because they know that someone needs it. (Would love to be shown otherwise.)

How would having RDF for open tree make a difference to you, personally?

Categories: Uncategorized

Direction of tree growth

As someone with computer science degrees who is working on an evolutionary biology project, I have to be constantly vigilant about tree-growth direction confusions. Just now I found the following sentence in an article in Algorithmica:

For v, w nodes in T, we say that v lies below w if the path from v to the root of T passes through w.

Now real trees are oriented with their root(s) at the bottom, the trunk in the middle, then the branches, and the leaves (or needles) at the very top. If v is a leaf or branch, how can it lie below something that’s on the path from v to the root?

Maybe we should picture a hook-shaped or umbrella-shaped tree, with its trunk shooting up and all of its branches and leaves hanging down from the top of the trunk. There are trees like that, I think. Or, a hanging vine or epiphyte, growing downward from the spot where it’s planted. Then v could be below w with w on the path from v to the root. (Hmm, I don’t think an epiphyte would grow down; the whole point of their plant-on-tree adaptation is to obtain sunlight, which of course comes from above.)

Drawing trees sideways is a neutral solution to make life equally difficult for both cultures, and you see a lot of phylogenetic trees drawn this way in the literature.

The phylogenetics folks on the project speak of one node being ‘deeper’ than another. It took me a while to figure this out but their usage is in agreement with real trees if you imagine them submerged, as you’d see in the forests near the mouth of the Amazon, the ones that have frugivorous fish. Of course this is contrary to the way ‘depth’ is used in computer science. When computer scientists talk about depth-first search, they mean to start at the leaves and go toward the root.

How did trees get flipped upside down like this? I think it comes from sentence diagramming, where by convention all the trees are drawn upside down. I would guess the custom found its way from sentence diagramming to computer science via Chomsky, who was very influential in the early days of CS, probably more so than, say, Ernst Mayr (see figure here to see how he drew them).

Added 2015-07-30:

1. In lattice theory one speaks of lower and upper bounds, and top and bottom elements. One interpretation of a lattice is as a family of sets, and when this is done usually the bigger sets go toward the top and the smaller ones toward the bottom. This is reflected in the usual v-like symbol for least upper bound or “join”, which reminds me of the u-like set union symbol, and greatest lower bound or “meet”, which looks like intersection. (By duality you could treat everything the opposite and the theory would all still work.) If you think of taxa being set-like, this puts the small taxa at the bottom and the large ones at the top. This is the opposite of what the biologists would prefer.

People who work on the mathematics of phylogenetic trees often appeal to the theory of upper semilattices, which being a flavor of lattice theory puts the root of the tree at the top, so they will have at least as much disorientation risk as I do.

2. In traditional taxonomies there is the notion of ‘higher’ and ‘lower’ taxonomic rank. The ‘higher’ ones, like kingdom, are the ones closer to the root of the taxonomic tree, and the ‘lower’ ones like genus are closer to the tips. This inverted orientation comes from applying a different metaphor, one incompatible with trees. The image this conjures for me is medieval power structures where the more powerful you are the higher your elevation. The higher you are, the better you can be heard (to command), the further you can see (for intelligence gathering), and the better positioned you are for waging war. So even within biology there is no consistency.

[Added 2016-03-12: good discussion of tree orientation on Tufte’s site; study on effect of tree layout on comprehension. Thanks Jim Allman!]

Categories: Uncategorized

When does x refer to y?

I have been concerned about the situation where a claim of the form ‘x refers to y’ is to be tested, perhaps because it is a requirement of a specification and one wants to see whether an engineered artifact (specifically a language-using agent) conforms to the specification. Claims of reference appear, on the surface, to require introspection, which is not generally something you do in an engineering context. What experiments or analysis do you perform (on an agent) to see whether the claim might hold, or not? Recognizing of course that in engineering, as in science, there is no proof, only absence of disproof.

Knowledge representation naysayers and semantic web pooh-poohers are in effect saying that talk of meaning and reference is not objective – it does not belong in science or engineering. I wonder if the failings of KR and semweb are not because they are inherently ill-founded, feeble, or intractable, but rather are due to inadequate understanding of meaning and reference, and consequent poor execution.

The question – how do you tell whether x refers to y? – was central to my puzzlement over W3C TAG issue httpRange-14 when I was involved with the TAG. Any answer to the question would seem to put a requirement on whether and how a URI refers.

I’ve argued here and in other posts (I repeat a lot) that it is possible to test claims of the form ‘s means p’ where s is a sentence and p is a proposition. This is because, in contrast to referring phrases (x above), there is an observable connection between the sentence being said, and certain states of affairs in the world. (Imperatives such as ‘complying with s leads to p’ work the same way.) Put briefly, s means p, if {s might be said} if and only if p.

I tried saying that x can refer to any y that has the property that every sentence of the form k∙x means the proposition p(y), where p is the meaning of the predicate phrase k. This is ugly and creates a circularity, since it would seem that assaying the meaning of x would require assaying the meaning of various p’s, which would require assaying the meaning of various x’s, etc. One might use this formulation to look for relative meaning of referring phrases and predicate phrases, but not for any independent statement of meaning of phrases (of the sort one can make for sentences). I acknowledge that relative meaning is more or less what model theory advances, but it seems counterintuitive to me. We argue about what a word means; we don’t seem to argue about what one word means relative to others.

(I write k∙x to denote the sentence composed from predicate phrase k and referring phrase x.)

What I recently noticed is that to test reference you don’t need to know what predicate phrases mean, only what sentences that contain them mean. I propose the following:

   x refers to y, if every sentence k∙x means a proposition that is about only y.

This proposal has a gazillion qualifiers.

  • ‘k∙x is about only y’ means that the truth of k∙x is affected only by (the state of) y; a change to something else that doesn’t affect y can’t change the truth of k∙x.

  • Not all sentences mean, so I’d want to change “every sentence k∙x” to “every meaningful sentence k∙x”. I left the word out to avoid clutter.

  • If a sentence has two referring phrases x and x′, then the proposition that the sentence means is ‘about only’ a combination of the two things that x and x′ refer to.

  • Sentences can mean propositions whose truth value is affected by variables not referenced in the sentence. ‘Grue‘ is the classical example, but ‘highly rated’ is similar (it is not said who is doing the rating). As a patch I would say that the languages under analysis would have to forbid such predicates, or else would have to be translated into some second language lacking them.

  • It is possible that two distinct subjects / entities / referents change their state exactly in tandem, in which case looking for patterns of change would not be enough to tell them apart. One example might be the two propositions p and not p. I suspect there are others, but there enough cases where a subject is adequately determined by its state space that I don’t consider this a fatal flaw.

  • The proposal may fail to uniquely ‘identify’ some intended y as the referent, in that applying all possible predicate phrases k to x could yield propositions all of which are about only some y’ that has ‘fewer’ states than x (i.e. the state space of y, considered as a partition of the world state space, might be a refinement of that of y’). That is, distinctions between certain states of y cannot be expressed in the language under consideration. – If this is the case, ways out would include: to consider the language to be deficient; to consider y to be a pathological or disallowed subject; to take the proposal to be a definition of reference; or to argue that the distinction between x and x’ cannot make a difference to whether an agent meets any specification.

  • The proposal may also fail to uniquely determine y if candidate referents can differ in ways other than in what doesn’t matter to them, i.e. other than in how their state spaces partition the world state space. After Yablo, I find the idea that subjects (or subject matters) are iso-ontic with their world-state-space partitions to be appealing, and while there are a few things about it that I don’t completely get, I’m sticking with it for the time being.

  • Deciding whether any given change to the world constitutes a change to some given y is by no means a science. This would be a negotiation between what is meant (at the meta-level) by the world state space, and what is meant by y.

  • Indefinite reference will require additional machinery or handwaving.

  • To broaden applicability we can interpret ‘change’ (i.e., differences between points in the world-state-space) broadly: not just as change in the physical world through time, but ‘motion’ through any kind of state-like set, such as possible contents of a document, possible identities, possible worlds, and so on. Not that I suggest a free for all, but that I don’t want to lose the framework on account of it appearing to be too narrow or rigid.

  • Obviously all the richness of human language is being put aside.

With apologies to Leibniz, Yablo, and the usual cast of characters (you know who you are).

More to come, I hope – this idea will require testing and elaboration.

Categories: Uncategorized