Archive

Archive for November, 2010

To what does a URI refer?

(Trying out a new theory here, apologies if I change my mind next week.)

Suppose someone sends you email (or a text message) that says:

“I really enjoyed reading http://en.wikipedia.org/wiki/Communist_manifesto – you should take a look!”

What piece of writing do you think they’re talking about? One written in the 21st century, or one written in the 19th century?

Because URIs were invented for the web, general practice has been to use a URI to refer to the thing that is on the web at that URI, if there is such a thing. That would be the same thing we observe when we “browse” to that web page. This holds categorically, regardless of what that web page says.

The categorical nature of this convention is important because computers are stupid. They are our assistants in interpreting the web, and can’t be expected to pick up on indirect cues that are supposed to inform the interpretation of a URI.

If we don’t observe any content at that URI – that is, if we get a non-success response (success being a “2xx” status in the HTTP protocol) – then all bets are off and some other way of using the URI might apply.

From time to time it is proposed to change the way we use these URIs (the ones that are associated, on the web, with web pages) to refer to things other than the web pages with which they’re associated. Attempting to disrupt an established practice is probably a bad idea. If there’s some problem for which redefining the way we use URIs seems to be the solution, we should look very carefully at other options before launching a campaign to convince everyone who uses URIs to do something new.

You can stop reading here if what you’ve read so far makes sense.

Objections

Web pages aren’t real or are ill-conceived? – It’s hard to point at anything in particular and identify it as being a web page. A file in a computer? Sometimes. The output of a content management system under certain circumstances? Sometimes. The stuff we might get when we GET? Maybe. A book or article or song? One would think. Based on the way we talk about them, web pages seem best described as regularities that might be observed at a fixed “place” (such as “at” a URI on the web). One page might be regularly observed to have the same author and declarative content, while another (a group blog or news site, for example) might have frequently changing author and content. In either case the page is characterized by its lawful variation or lack thereof. Web pages may be one or two degrees abstracted from reality, but they are as real as many things we usefully talk about.

We can use one URI to refer to different things depending on context? – You can try, but there are some problems with trying to revoke the “U” in “URI”. One is that computers and programmers get things wrong, and the likelihood that the context might be misread makes for a fragile system. Another is that context sensitivity is a threat to polymorphism (functions that are generic across domains) and to interoperability (combining functions across domains).

And don’t get me started on what the specs say.

We can say a URI refers to an ontological chimera of a web page and something else? – That is, http://dbpedia.org/page/Paris might simultaneously have both authors and a population? The authors of the page, that is, and the population of the city. That sounds confusing, what kind of thing would you say it is, a pagity? Worse, what do you do when the “something else” is itself similar to a web page, as in the example at the top, and there are statements (author, subject matter, date, etc.) that are true of the snake and false of the goat?

Regularities are unknowable? – If a web page is defined by its regularities, and regularities are universally quantified statements (e.g. the author of this page is always Jonathan Rees), and universal propositions are unknowable (cf. Popper), then isn’t the “web page” idea useless? Well, regularities of physically realized web pages are in the realm of hypotheses about the future that can’t be proven, only falsified. The answer to that is that we make statements like this about reality all the time, and do useful inferences with them, subject to whatever epistemological framework we choose to work with. We try not to be wrong, but if we are, we fix our mistakes.

How do regularities come to be known (or hypothesized)? – Maybe someone tells you (like the individual who deployed the page, or a page that links to the page), and you find them credible. Or, you make observations (look at the page a few times) and form an idea. Or maybe it resembles some other web page you know a lot about, and you’re willing to make assumptions based on that. Or maybe you control the page yourself and can cause it to have whatever properties you like.

We don’t really use URIs to refer, only to locate? – That is, the example at the top might be expected to lead to confusion, and to be clear we really ought to say “the article at (URI)” or “the manifesto described at (URI)”, never (URI) without “at”. Maybe so, but retracting the use of URIs for reference would be a major redesign of the way we talk about the web (at least in IETF and W3C), and would obsolete most if not all RDF content.

But we often use URIs to refer to things that aren’t web pages? – Yes, but URIs that don’t refer to web pages aren’t made to look like them. Thus the practice of using HTTP 303, or “hash URIs” without a corresponding document fragment, as ways to avoid such an appearance. These techniques are kludges but historically they made the ‘semantic web’ possible since they didn’t rely on deployment of a new protocol or a revolution in the URIs are used.

I can’t teach my server to do 303 or fragment ids? – As these are the only ways at present to associate URIs with things other than web pages in a way that’s easily discoverable, you’ll need to either contract with someone else to deploy your URIs, forego discoverability (i.e. use RDF with unresolvable URIs – this is unpleasant but not technically wrong), or team up with others to convince the world to implement a new discovery protocol that you can implement (such as /.well-known/host-meta). The standards organizations may be the most effective way to do this.

However, I find it highly unlikely that neither discovery technique will work for you, as they’re not that tricky. Seek advice.

What “web page”, I thought they were “information resources”? – You get a more general theory if there is a class of things that includes both web pages and potential web pages: things that might not be on the web yet or ever, but could be. We might call these things “information resources”. The term “information resource” has a confusing history, and could easily be seen as incompatible with what I’m talking about, but if they sound the same to you that’s fine.

Categories: Uncategorized