Another of my essays about the evolution of formal languages (programming languages, markup languages, etc.)
I still get confused by this stuff…
“I warn my aging male friends to guard against this effect, to avoid wallowing in cantankery, to stave off the inner curmudgeon. It is an emotional dynamic characterized by bitterness, resentment, a lack of indulgence and a rigid adherence to authoritarian sentiments.” — Monsieur Chariot writing on open.salon.com. I confess the unstaved influence of my inner curmudgeon dates to before I was 20.
From 2002 to 2005 I kept a blog, actually just a file with no RSS feed. Not exactly gripping reading, but then I never promised you that. Some of it, like the Alan Kay links, looks pretty interesting and I should review at some point.
In case you look at the page, and wonder what became of some of these activities: I did have a good time in Xi’an, more or less; photos are here. I kept a private blog during those four months and may transfer some of those entries to this one, in my abundant spare time. The MCZ decided not to support Psyche and I decided I couldn’t devote the time that would have been required to run it, so I brokered its transfer to Hindawi Publishing, where it is flourishing, with nine completed articles so far, and three in preparation. My sincere thanks to Paul Peters, Jim Traniello, and others who have been responsible for the revival of this venerable institution. I did have it all scanned and OCR’d. Next job would be to do entity extraction on it – it’s chock full of species and place names – and link to one or more of the biodiversity catalogs.
The entries about spam make me wonder exactly when the problem became so bad that automated filters were needed and how rapid the transition from the spamless world to the spamful world was like. In December 2003 I was using a filter but was still looking at the positives, which means it must have been pretty new at the time. Wikipedia disappoints here: its history starts in 2004.
OK, Tim D is giving me a hard time about this, so I need to talk about it a bit.
I think terminology is pretty important and tend to spend a lot of time thinking about it and talking about it. One might take the Humpty Dumpty position that a word can be redefined in any way one likes, and if your clout level is either very high or very circumscribed you can get away with it. For example, mathematicians redefine common words all the time as terms of art with meanings ridiculously detached from ordinary usage: group, field, ring, catastrophe, category, object, arrow, complex, matrix, and so on. They get away with it because context is rarely lost; you know when you’re doing mathematics and not; and also because they’re a strong force: when they deliver value, which they do, they earn the right to change the meanings of these words.
Other acceptable cases include when there is clear scope (as when Don Knuth’s book Surreal Numbers redefines “number”); humor, absurdity or affection; some kind of marker such as capitalization (such as “BOA” in Common Lisp); or a usage that is so remote from ordinary use that there can be no confusion (“BOA” is also an example of this).
I am also not too worried when a definition as a term of art is a subset of common usage, or if the stretch in meaning is not too much (such as “record” in a database).
But when the term is given a meaning that overlaps common use you’re asking for trouble. The reason is the danger that the use can get detached from the context in which the term is defined as a term of art. This can happen as a result of a copy/paste, or someone entering in the middle of a conversation, or someone just forgetting the definition, or even purely unconscious forces that introduce bias and false intuitions.
I want to gripe first about “identifier” (which is currently being discussed on the IAO list). This has been corrupted by the computing and web folks to be almost meaningless. The corruption is even reflected in the Wikipedia article. Wordnet gets closer to a natural definition: “a symbol that establishes the identity of the one bearing it”. My preference is to restrict use to cases where a mark is borne by the thing that’s supposed to be identified, and where the mark can actually serve an identification purpose. Basically an identifier is any marking or other property that can be used to decide whether the thing at hand is the same as, or different from, a thing seen at another time or by someone else – i.e. to discriminate between a state of affairs in which something is seen twice and one in which two things are seen. Good examples are [UPCs (written on labels on a product), ISBNs (similarly), – see comments!] unique keys in database records, RFIDs, fingerprints, scars, etc.
Compare to “identifier” in the Scheme reports: I found no direct definition, but we have “let, even though it is an identifier, is not a variable, but is instead a …” and similar examples. Apparently the usage was picked up from usage in other similar documents in the computing literature. “Identifier” is used as a syntactic category in the language, not with reference to any particular role in identification processes, which is sort of like defining “doctor” to be any person.
(I’m not criticizing any editor of the report, as we were all complicit in this.)
Another example is “identifier” in RFC 3986. If one makes the reasonable assumption that a URI is a kind of identifier (that’s what the “I” stands for) one gets into all kinds of trouble. tag:email@example.com,2009:fdsa is a URI (i.e. syntactically satisfies the spec), but as it only occurs in this blog post, without explanation, what on earth would it identify, and how would it do so if it did?
Even if we allow that a URI is only a potential identifier – that is, a string designed to participate in some system of identification, as opposed to one that actually does participate – the case for identifierness is tenuous. In what sense could http://google.com/ be an identifier? It might be useful to some web server in identifying one of the entities it has on hand as the one that a client is talking about. But it’s more likely that the server just has a table, or some other process, that matches request URIs (strings) with a source of information (“data object or service” per RFC 2616) that itself does not bear an identifying mark. Of course the URI is of no use to a client in identifying anything. A better term for the role the string plays on the client side might be name, locator, or designator.
At best, you’d have to distinguish a true identifier from a potential identifier, just as one would distinguish an actual word (“frog”) from a potential word (“gorf”) or an actual name (“Pat Hayes”) from a potential name (“Zacharias Mbutu” – if someone in fact has this name I apologize and will fix the example!). But consider the Scheme case of “let”. This string certainly plays an important role in the language, but the role it plays is not one of “identifying” a syntactic form (with associated semantics) in the sense of telling that form apart from other forms that it might be confused with – it merely designates or names the form. Perhaps a Scheme interpreter might have a collection of things, each marked with a string, and it might use the string “let” attached to something to identify that thing as the one that should be used internally to interpret an expression; but this is certainly not demanded of an implementation, and is not what the programmer has in mind. To the programmer the string “let” has usage, meaning, etc. but is not used for identifying anything.
(Apologies to Pat Hayes, the originator of this argument.)
Gotta run; I’ll rail against “resource” soon, and maybe “variable”.
At TDWG 2009 , which ended about two hours ago, I learned of Lincoln Stein’s 2002 paper Creating a bioinformatics nation. It’s not open access so I haven’t read it yet, but one can get the gist here and here. The talk/paper was apparently influential (to those better clued in than me), and I’ve added it to the Neurocommons reading list. The main message is: Don’t make us screen-scrape; provide machine-friendly access to your stuff.
I’m not too keen on web services with fancy APIs; REST-based query interfaces, bulk downloads, and SPARQL seem to be easier to use. But the idea is right.
Has there been any progress since 2002?