How to apply a CC0 waiver to an ontology

OK, there are two issues, one being what statements (triples) are needed in order to assert the waiver, the other being where to put them.

If there is a “landing page” for the ontology then CC Rel by Example gives a good start at documentation for what to do. It tells you the operative statement, which is

<uri-of-file-containing-ontology>
xhv:license
<http://creativecommons.org/publicdomain/zero/1.0/>.

where xhv: abbreviates http://www.w3.org/1999/xhtml/vocab# .

Ideally you would assert this predicate and object for both the ontology (via its ontology URI) and the ontology version (if the version has its own URI), repeating for as many aliases as you know about. (Ontology versions are a particular feature of OWL 2, not of RDF.) You want to cover as many bases as you can. So you could end up with many statements like this.

Similarly, you want to put these statements in as many places as you can, not just the ontology file itself but also any landing page that it might have (as shown in RDFa in the ccrel-guide).

Putting statements into an RDF serialization (e.g. RDF/XML) is straightforward, as shown, if you are editing the serialization directly. But if you are using an OWL tool such as Protege, it
could be harder. Protege gives you two methods that might be used, ontology annotations and individual property assertions. You can use the ontology annotation pane to add as xhv:version to the ontology, but not the ontology version. To add individual property assertions for the ontology version you may have to put the three or more URIs in the ontology itself, which would just be tedious clutter, but I don’t see another choice.

Sadly all this work is speculative as there are no tools at present (of which I’m aware) that would pick up on the CC0 annotation. That’s not to say you shouldn’t do it, in fact I’m glad someone is willing to be a pioneer, as it will be a chicken-and-egg situation for quite a while.

In addition to expressing the waiver in RDF I would recommend writing a copyright statement in prose in an rdfs:comment ontology annotation property. The RDF statements themselves are likely to get lost or ignored, but with the rdfs:comment you have humans on your side. For wording you could use that given in the CC Rel guide or by the CC0 ‘chooser’ tool.

All of the above also applies if you’re attaching CC-BY or some other waiver or annotation, but ontologies are going to be easier to work with if they’re unencumbered, and the whole reason you wrote the ontology was so that it would be used, right?

Exercise for the adventurous reader: How does this approach fail if the httpRange-14 resolution‘s advice isn’t observed?

Thanks to Ruth Duerr for asking.

Categories: Uncategorized

Tough URLs

Henry Thompson and I have been puzzling for a few years over the question of why the Web doesn’t have URIs that are widely perceived as both robust – in the sense of resisting attacks such as expiration, corruption, and censorship – and actionable – in the sense of just working in the browser. We have identifiers systems that are one or the other, but not both – why?

The robust identifier systems that we have range from pre-Internet ones like Linnaeus’s binomial species names (which are tied to their priority literature reference), the chemical element symbols, ISSN, and so on, to modern inventions such as URNs, info: URIs, and the digital object identifier (DOI). Our actionable identifiers (or locators?) are things like http: and ftp: URIs – a notably disjoint set.

Why should anyone care about robust actionable URIs? The reason is that, if they existed, they would marry a cornerstone of civil discourse with to the central modern communication technology, namely the Web.

We take robust reference for granted in everyday civil, legal, scientific, technical, and political discourse, so much so that it is not even called out as a named phenomenon. If you’re debating a law or a scientific article with someone, the last thing you want is for your argument to go wrong because the two sides are working from different documents – especially if the difference goes undetected. This would be stupid.

But reliable reference was not always the rule. It took the world hundreds of years following the invention of the printing press to deal with this problem. Now we are repeating the reference chaos of reference in the early print world on modern technology.

References are easy to deal with if you are a human, speaking natural language, with a bit of time on your hands. If you see the species name Rana pipiens and know a little bit about how species names work you can look it up to get the primary reference for that name. Each identifier system has its own set of resolution services, many of them on the Web and open. But informal references in dozens of different identifier systems is not the same as being first-class citizens on the Web – as I say human intervention is required. Making references accessible to computers using ordinary (i.e. Web) protocols vastly accelerates any process that needs to follow them. And to do this, today, you need something that starts with http:// and a domain name.

By now you have no doubt found many ways to poke holes in what I’ve said so far. Are “tough” URIs really possible? What exactly could that mean? Isn’t it impossible to eliminate all vulnerabilities? On the other hand, given that the examples of robust mostly are, isn’t a URI such as http://dx.doi.org/10.1155/1987/47105 a counterexample to my claim that we don’t have robust actionable URIs? And if this is such a problem, why on earth hasn’t it been solved already? Is it inherently intractable or is this some kind of awful techno-social mistake that can be fixed?

What interests me is a sweet spot in between these two extremes: more robust that current-day doi.org URIs, but admitting the unavoidable inevitability of certain vulnerabilities.

OK, I have more to say about threat analysis, IDF, ICANN, P2P, and so on, and will do so in a followup. In the meantime – if you want to talk about this, please come to our workshop in Bristol, UK, on December 8th!

Workshop announcement

Categories: Uncategorized

Leveling the field for open access

Do researchers prefer to publish in closed access journals rather than open access in order to avoid OA publication charges? Librarians would certainly prefer they publish open access, since OA reduces their costs (in the long run) by reducing their subscription burden. I don’t know the answer, but if this is happening, universities might take steps to eliminate the incentive.

Here’s an idea: For each closed access article published, the university assesses a “subscription tax” comparable to what would have been assessed had the article been published open access, or maybe higher. That is, you can continue to support the subscription model, but you’ll have to pay for it, just as those publishing open access have to pay for open access.

The subscription tax goes to the libraries and is used to pay for subscriptions. The university overhead rate can be reduced, for everyone, by the amount raised through this new revenue stream.

Professor Smith’s decision between closed and open can now be made without financial bias. Currently she’ll pay (from her grant) $50,000 overhead plus $1,500 for open access or $0 for closed access. This would change to $48,500 overhead plus $1,500 for open access or $1,500 tax for closed access. Her grant officer is happy because the switch from overhead to article charge-or-tax is dollars-neutral, Smith is happy because she doesn’t need to factor open/closed into her venue decision, and the librarian is happy because more OA publishing is taking place and subscription load is dropping.

I know my arithmetic probably doesn’t come out right, but I hope you get the idea.

I’m sure someone has already thought about this, but this is a blog so I get to write things like the above without bothering to do background research.

(Inspired by Francis Pinter’s video and the recent Guardian piece.)

Categories: Uncategorized

Authorization and meaning

Larry,

I’m reviewing the issue-57 discussion at the recent TAG F2F and I notice that you hammer repeatedly against the connection between meaning and service:

  • I don’t think it should cost energy to mean something
  • I’m concerned [about] the way you’re describing the pattern – [dependency on running web server] … if you do a GET on the string, web server must be running when you do that GET.

and so on.

These IRC comments were ignored at the time. Now I will try to answer them.

If I ever talk about meaning being determined by GET, it is only a shorthand for a more nuanced story about authorization; I don’t really mean it. I’m sorry if that’s confusing.

My understanding of this from HTTPbis is that a “representation” (or other response), over whatever protocol (including any inter-brain protocol), is authorized for an http: URI by the domain name owner. That is, it is not correct for a cache or proxy to deliver a representation for a dereference of that URI that is not so authorized. The HTTP protocol is one way to express such an authorization, and because of Expires: the authorization can last for up to a year. But it does not take any energy for a representation to *be* authorized for a URI. The domain owner’s server can shut down completely, but copies cached in disks or on the sides of buses or inside brains can continue to be authorized.

You might even be able to find out whether a representation is authorized by, say, calling the current domain owner on the phone (not sure whether HTTPbis allows this). And other URI schemes have their own ways for a representation to be authorized. For example, RFC 2397 and your duri: draft authorize representations for a whole bunch of URIs for a very long time (forever).

But regardless of the URI scheme, representation-related meaning is either cached (maybe in your brain), looked up, or calculated by an authorized formula (as in the case of data:), and while perhaps no energy is needed for meaning itself, there is no caching or lookup or calculation apparatus that does not require energy to maintain. (Of course you know this distinction, sorry if I seem to lecture.)

If there were a legitimate way to authorize an http: representation for a very long period of time, such as true domain name ownership, then maybe we wouldn’t have to worry so much about meanings in http: space changing every year. RFC 2397 seems to do a pretty good job of authorizing representations forever; but the future is inherently unpredictable, so even the meanings of data: and duri: URIs is uncertain. Perhaps their meanings will be redefined by HTML6, in order to accommodate the way deployed infrastructure understands them. … unless by “meaning” you mean duri:2011:word:meaning… oh wait…

Now I’m mostly with you on http: not necessarily being the best basis for civil discourse (which requires citations, quite separately from up-to-date links), and duri: being superior in many ways. But issue-57 is not the best place, in my opinion, to address persistence concerns, given choices that have already been made by the affected community. I think it belongs with issue-50.

(Footnote: by “authority” throughout I mean the nice kind, the kind that’s granted, not imposed.)

(Footnote: someday I’ll figure out a ZBAC way to analyze all this. Not there yet.)

Categories: Uncategorized

Crossref’s gift of metadata

2011-05-04 1 comment

I was delighted to learn of Crossref’s April 20 announcement (press release ; Geoff Bilder’s blog post) that they are making their DOI metadata available in RDF via HTTP. This is a significant development for scholarship on the Web and an important step toward a fully open and reliable scholarly edifice.

For those of you not familiar with this database, it has about 46 million records (and growing), keyed by strings called “digital object identifiers”. DOIs are similar to the ISBNs used for books, but are applied at a finer level of granularity – mainly for academic research articles published in the past 10 years, but with coverage steadily growing. Each record has basic bibliographic metadata for its “object” such as author, title, publisher, publication date. For an example try

curl -D – -L -H “Accept: text/turtle” “http://dx.doi.org/10.1155/1974/82714″

(This Google 1-gram, although it ony reflects occurrences of “DOI” in books, hints at the growing popularity of DOIs.)

The value of the database derives in large part from the strength of Crossref’s publisher rules, which help guarantee DOI uniqueness and metadata quality.

Open access to metadata is not as wonderful as open access to the content of the articles, but it’s an important toehold. For example, DOI metadata may be what enables an automated assistant to find a copy of an article in a library collection you have access to, or to find data sets or database accessions that come from it or refer to it.

Crossref’s announcement is much more important than your run of the mill open data announcement, for a variety of reasons. First, the data is central, since the literature is a hub for other kinds of information. This database describes the scholarly literature, the backbone of research. Nearly anything you want to say or record, as a scholar, either derives from the literature or uses it as evidence. DOI metadata helps make all kinds of statements more concise, rigorous, and machine-friendly.

Second, the data can be used by a wide variety of tools. Reference managers such as Mendeley and Zotero already access DOI metadata – I’m not sure how, possibly using older the password protected OpenURL interface in some sneaky way – in that you can give them a DOI and they will automatically fill in author, title, and so on in a reference list. But now all sorts of other tools will be able to do the same sort of thing. I imagine Crossref’s service becoming standard in all sorts of annotation and social networking tools, database front ends, and so on. Rather than scraping this information from web pages, a tool can just find or accept a DOI, and obtain the metadata from Crossref.

Third, it suggests that we may be getting closer to bulk download and open mirrors for Crossref’s data. Such mirrors will be necessary not only for use in citation network research and integration into other databases, but also in order to protect the DOI system from attack. Given the international nature and inherent skepticism of the scholarly community, it is important that access to the metadata not be vulnerable to the administrative, technical, or legal failure of Crossref or its supporting infrastructure. Lots of copies of the database would mean protection against such failures. [removed sentence 5/5]

Fourth, this information is interesting enough that developers who have previously stayed away from RDF and LOD will now link in an RDF parser as a means to an end, not an end in itself. This ought to be a boost to the LOD world, which in my mind is dominated by solutions in search of a problem.

Fifth, it is very cool that Crossref is observing the “httpRange-14 resolution”, which in effect says that metadata applies to normal Web pages, by using HTTP 303 responses to flag a not-so-normal situation. The 303 ensures that the URI form of a DOI refers to the article itself, even if it’s behind a paywall, not to the landing page that you might arrive at when dereferencing the URI. Crossref could easily have taken the low road and kept the 302 redirects they were using before, but that would have led to confusion over whether the metadata applied to the landing page or to the article, and they had the wisdom to foresee this. This is a subtle point of Web architecture and I’m glad they got it.

It would be nice if the metadata pages asserted their own legal status, preferably using a CC0 waiver. This is probably not necessary since the information is factual and (IANAL) not protected by copyright law, but clarity is always welcome. This issue is endemic to all open data, so I will take it up another time and not single out Crossref.

Categories: Uncategorized

Are you confused yet about the word “representation”?

Wordnet entry for “representation”.

S: (n) representation, mental representation, internal representation (a presentation to the mind in the form of an idea or image)
S: (n) representation (a creation that is a visual or tangible rendering of someone or something)
S: (n) representation (the act of representing; standing in for someone or some group and speaking with authority in their behalf)
S: (n) representation, delegacy, agency (the state of serving as an official and authorized delegate or agent)
S: (n) representation (a body of legislators that serve in behalf of some constituency) “a Congressional vacancy occurred in the representation from California”
S: (n) representation (a factual statement made by one party in order to induce another party to enter into a contract) “the sales contract contains several representations by the vendor”
S: (n) theatrical performance, theatrical, representation, histrionics (a performance of a play)
S: (n) representation (a statement of facts and reasons made in appealing or protesting) “certain representations were made concerning police brutality”
S: (n) representation (the right of being represented by delegates who have a voice in some legislative body)
S: (n) representation (an activity that stands as an equivalent of something or results in an equivalent)


Niklaus Wirth.
Program Development by Stepwise Refinement.
Communications of the ACM, Vol. 14, No. 4, April 1971, pp. 221-227.

In order to refine these instructions and predicates further in the direction of instructions and predicates available in common programming languages, it becomes necessary to express them in terms of data representable in those languages. A decision on how to represent the relevant facts in terms of data can therefore no longer be postponed.


Tim Berners-Lee.
Generic Resources.
Web page, 1996-2009.

A resource may be generic in that as a concept it is well specified but not so specifically specified that it can only be represented by a single bit stream.


T. Berners-Lee, R. Fielding, and H. Frystyk.
Hypertext Transfer Protocol — HTTP/1.0.
RFC 1945, IETF, May 1996.

A feature of HTTP is the typing of data representation, allowing systems to be built independently of the data being transferred.


R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee.
Hypertext Transfer Protocol — HTTP/1.1.
RFC 2616, IETF, June 1999.

A feature of HTTP is the typing and negotiation of data representation, allowing systems to be built independently of the data being transferred.

Resources may be available in multiple representations (e.g. multiple languages, data formats, size, and resolutions).

representation [definition from glossary]:
An entity included with a response that is subject to content negotiation.


Tim Berners-Lee.
“The range of the HTTP dereference function.”
Email to www-tag list, March 2002.

HTTP is a protocol which provides, for the client, a mapping (the http URI dereference function) from URI starting with “http:” and not containing a “#” to a representation of a document. The document is the abstract thing and the representation is bits.


Roy T. Fielding and Richard N. Taylor.
Principled Design of the Modern Web Architecture.
ACM Transactions on Internet Technology (TOIT), 2002.
[JAR comments inserted]

… allowing a user to progress through the application by selecting a link or submitting a short data-entry form, with each action resulting in a transition to the next state of the application by transferring a representation of that state to the user.

[A representation is of a state.]

REST components communicate by transferring a representation of the
data …

[A representation is of data.]

Finally, it allows an author to reference the concept rather than some singular representation of that concept, thus removing the need to change all existing links whenever the representation changes.

[A representation is of a concept.]

Depending on the message control data, a given representation may indicate the current state of the requested resource, the desired state for the requested resource, or the value of some other resource, such as a representation of the input data within a client’s query
form, or a representation of some error condition for a response.

[A representation indicates a state.]

… the specification of Web addresses also defines the scope and semantics of what we mean by resource, which has changed since the early Web architecture. REST was used to define the term resource for the URI standard [Berners-Lee et al. 1998], as well as the overall semantics of the generic interface for manipulating resources via their representations.

[A resource can be manipulated via a machine interface.]

A resource does not always map to a singular file, but all resources that are not static are derived from some other resources, and by following the derivation tree an author can eventually find all of the source resources that must be edited in order to modify the representation of a resource. [emphasis JAR's]

[A resource is derived from editable sources.]

Semantics are a byproduct of the act of assigning resource identifiers and populating those resources with representations. At no time whatsoever do the server or client
software need to know or understand the meaning of a URI — they merely act as a conduit through which the creator of a resource (a human naming authority) can associate representations with the semantics identified by the URI. In other words, there are no resources on the server; just mechanisms that supply answers across an abstract interface defined by resources. [emphasis JAR's]

[Resources do not reside on servers.]


Ian Jacobs and Norman Walsh, editors.
Architecture of the World Wide Web, Volume One.
W3C Recommendation, December 2004.

[Document approved for advancement to Proposed Recommendation by Roy Fielding and the rest of the TAG.]

[Glossary] A representation is data that encodes information about resource state. Representations do not necessarily describe the resource, or portray a likeness of the resource, or represent the resource in other senses of the word “represent”.

[1] In this travel scenario, the resource is a periodically updated report on the weather in Oaxaca…

[2.2] In the case of this document, the message payload is the representation of this document.


Tim Berners-Lee, Roy Fielding, and Larry Masinter.
Uniform Resource Identifier (URI): Generic Syntax.
RFC 3986, IETF, January 2005.

When URIs are used within information retrieval systems to identify sources of information, the most common form of URI dereference is “retrieval”: making use of a URI in order to retrieve a representation of its associated resource. A “representation” is a sequence of octets, along with representation metadata describing those octets,
that constitutes a record of the state of the resource at the time when the representation is generated.


Tim Berners-Lee.
The meaning of “representation”.
Email to www-tag list, November 2007.

In fact, the relationship includes social as well as technical aspects. It also is defined, often, by high-level protocols. These higher level protocols set common expectations between the publisher and the reader. These are not consistent across the web. That is why you can’t simplistically just give a formula for that relationship.
…There is information in it in RDF which says that I made it and it is my public profile document.
…These expectations are very important to the web working. That’s why you can’t just write down a formula for the constraint on the representations of a given resource.


Roy Fielding.
“Some notes on organizing discussion on WebApps architecture.”
Email to www-tag list, October 2010.

On Oct 14, 2010, at 2:11 PM, Larry Masinter wrote:

Well, I wonder if we might introduce another step between “resource” and “representation” which is “application resource in identified state”, so that the representation isn’t a representation of the resource, but a representation of the resource in that state.

Umm, what? That would be terribly confusing and contrary to why I used the term representation in the first place (it is a representation by the origin server to the recipient of the state of that identified resource at the time of message generation).

Categories: Uncategorized

The place of metadata

I am entering this picture into the blog mainly so I can refer to it if I need to.

Relationships between thing, IR, description, metadata

Relationships between thing, IR, description, metadata

The problem is that people often get confused about these relationships. Maybe I am confused and it’s catching, but I have no evidence of this.

Some people like to talk about “the metadata for a bridge”.  That really bothers me. Metadata in common usage (outside of circles of confused informatics weenies) means information about information. Look it up. So what’s really meant is “a description of a bridge” or “data about a bridge”.

Some people like to say that all information resources (document-like things) are descriptions. They aren’t. A symphony isn’t a description, nor is a word list.

The main message is the center axis but the side classes are added to document what another subclass choice would be in each case.

I don’t know much UML but I have picked up one habit, which is to use this kind of arrow (the one whose head looks like a triangle) to mean “is a subclass of” or “is a”. If it were a relation like others you see in diagrams like this where the nodes are classes, the relation would be “equals” where some of the things in the target are not equal to anything in the source. Using a different kind of arrowhead emphasizes the difference in kind. Relation links to me mean “some of these are [label]-related to some of those.” I could use UML or ER detailing (double-arrow and so on) to say whether it’s some or all in each case and whether the relationship is functional or inverse functional, but I don’t because I can never member what those funny annotations mean.

To relate to the FRBR discussion, a BR (bibliographic record) would be one kind of metadata record.

Categories: Uncategorized

A blowfly reflects

Data is (are) the prey and sustenance of science.  With vigilance, cunning, and skill, the lionesses detect and subdue it.  The alpha males get first access to the kill (and, strangely, credit for it), followed by the females and cubs. When they cannot be kept at bay any longer, the hyenas get to continue the lions’ work.  At that point, if we’re lucky, the remaining carcass is left to open access, and we vultures and blowflies can do with it what we will.

Categories: Uncategorized

FRBR and the Web

I’m going to assume some familiarity with FRBR, so if you want to read the below and don’t know FRBR, you might want to consult the FRBR specification.

The idea of FRBR is that when organizing bibliographic information into database records it’s useful to group the kinds of things you would typically say into four groups.  If you have a physical book, you might talk about (1) attributes specific to that physical copy, such where it is and what condition it’s in; (2) attributes shared by other copies in its print run or edition; (3) attributes shared with other editions such as title, date, and author; and (4) attributes shared with translations and adaptations, such as intended audience or form (e.g. novel vs. poem).  If you organize your database this way you will have four different bibliographic records applicable to the book.  Overall the system’s records will be arranged in a forest, so that for each more-specific record there is exactly one less-specific record.

Now the description of FRBR says that each record “is about” or “describes” some “entity”, an ontologically dubious proposition.  I would say FRBR entities are theoretical inventions, abstractions, or fictions, depending on my mood, and from an ontological point of view it’s not clear to me that saying the records are about things other than physical items is the most helpful way to think about the enterprise. But it’s not particularly harmful either.

For someone familiar with description logic (DL) a different framework presents itself based on these “functional requirements”.  You could say that your domain of discourse consists of FRBR Items, physical carriers of information.  If you state a set of properties (such as a number of properties at a single FRBR level) you have defined a class of Items having those properties, so any of the four three other kinds of record is in effect a class definition.  An “embodies” or “realizes” relation between FRBR “entities” is really a subclass relation between classes of Items.

However the two formulations are interdefinable.  Forgive me if I switch back and forth.

This brings me to the Web.  If I write an article in my word processor, and save it to disk, there are two Items, the copy in main memory and the copy on the disk.  Put it on a Web server and many more Items (copies) can be made, but (in FRBR terms) they all “exemplify” a single Manifestation.  In DL terms you’d say they share Manifestation-level properties, and all things that have those properties form the class corresponding to the Manifestation.  In the HTTP protocol many copies of the source Item are made as the information passes through network buffers, proxies, caches, and application libraries, but they all still exemplify the same Manifestation.  So it seems safe to say that the any single GET/200 exchange is associated with a single Manifestation, at least when the response has any resemblance to something you’d be concerned with in a “bibliographic record”.

Now if you do multiple HTTP GETs of the same URI, you may always see the same Manifestation, in which case you might say there’s a special relationship between that URI and the Manifestation.  If on the other hand, there is variation between the Manifestations you get but they all embody a single Expression (for example, you get the same words encoded differently – say text/plain and text/html), then there would seem to be a special relationship between that URI and the Expression.  Similarly for Work – content negotiation could get you Expressions in different languages.

Whether there is response variation, and how widely it extends, is impossible to determine using HTTP GET alone (even with the help of Vary: headers), so the truth value of these ‘cozy’ relationships is unknowable for Popperesque reasons.  I may issue a thousand requests and always get the same Manifestion, but get a different one on request number 1001.  So if I say that a URI is cozy with a Manifestation, I had better either have inside knowledge of how that web server is configured (e.g. I could be the one running it), or else I need to be prepared to be proven wrong.

For some URIs there is no such cozy relationship of the URI to any FRBR entity.  There is always the aggregate collection of all the received Manifestations (or Expressions or Works), but the relationship between the Manifestations you get and the aggregate is different, part-of rather than embodies (etc.).  For example, different blog posts can come from the same URI at different times, but they are parts of the blog, not embodiments (realizations, etc.) of it.  FRBR would describe the blog posts and the blog very differently.

The Manifestation “cloud” around some URIs seems to fall outside of FRBR entirely.  Consider a page inside a bank’s web site for current account balances.  From a single URI, different users get different pages depending on their session authentication.  This is not really a serial publication, as pages might be issued simultaneously, and it is not really a collection since the pages aren’t collected together in one place.  So the URI does not seem to be ‘cozy’ with any FRBR entity.  Hoping that FRBR will be helpful in understanding all Web URIs is probably too much to ask.

We can connect some of this to web architecture rhetoric.   The webarch theory holds that there is something that the URI “identifies” and that it is an “information resource” that has “representations”.  It seems consistent to say that a “representation” is close to what FRBR calls a Manifestation, and that FRBR Manifestations, Expressions, and Works are all can all be “information resources” since any of the three can cozy up to a URI.  (Compare TimBL’s “Generic Resources” note.)  Undoubtedly there are other “information resources” as well but they may either correspond to FRBR aggregates swapping “exemplifies” and so on for “part of” in the way “representation of” works, or they may be undescribable using FRBR.

[2/13: Alan R points out that according to FRBR not all Expressions have Manifestations.]

In the “semantic web” world, where a URI is used not just with HTTP but as a name that refers to something, one could use a Web URI to refer to the Manifestation you get, its Expression, or the Expression’s Work, depending on the range of Manifestations that you get from GET/200 (and given the limitations described above).  This is where the DL Item-class approach wins, because it lets you ascribe properties to “information resources” without having to commit to any particularly level of the FRBR hierarchy, and thus without having to be conscious of the FRBR record modularity.  You can write down an Expression-level property, and then if you later find it’s better to use the URI to refer to a Manifestation that’s not a problem.  Of course the other direction doesn’t work so well. (I’m pulling a fast one here – I need to write another blog post on generic individuals.)

For a more sophisticated picture of how the Web relates to documents, see Henry Thompson’s article on URIs.

I’m surely not the first to think about all this, but I got tired of research after two Google searches.

Thanks to Allen Renear for prompting these thoughts and being interested, Tim Danford for “cryptic class”, and Alan Ruttenberg for null hypotheses.

Categories: Uncategorized

To what does a URI refer?

(Trying out a new theory here, apologies if I change my mind next week.)

Suppose someone sends you email (or a text message) that says:

“I really enjoyed reading http://en.wikipedia.org/wiki/Communist_manifesto – you should take a look!”

What piece of writing do you think they’re talking about? One written in the 21st century, or one written in the 19th century?

Because URIs were invented for the web, general practice has been to use a URI to refer to the thing that is on the web at that URI, if there is such a thing. That would be the same thing we observe when we “browse” to that web page. This holds categorically, regardless of what that web page says.

The categorical nature of this convention is important because computers are stupid. They are our assistants in interpreting the web, and can’t be expected to pick up on indirect cues that are supposed to inform the interpretation of a URI.

If we don’t observe any content at that URI – that is, if we get a non-success response (success being a “2xx” status in the HTTP protocol) – then all bets are off and some other way of using the URI might apply.

From time to time it is proposed to change the way we use these URIs (the ones that are associated, on the web, with web pages) to refer to things other than the web pages with which they’re associated. Attempting to disrupt an established practice is probably a bad idea. If there’s some problem for which redefining the way we use URIs seems to be the solution, we should look very carefully at other options before launching a campaign to convince everyone who uses URIs to do something new.

You can stop reading here if what you’ve read so far makes sense.

Objections

Web pages aren’t real or are ill-conceived? - It’s hard to point at anything in particular and identify it as being a web page. A file in a computer? Sometimes. The output of a content management system under certain circumstances? Sometimes. The stuff we might get when we GET? Maybe. A book or article or song? One would think. Based on the way we talk about them, web pages seem best described as regularities that might be observed at a fixed “place” (such as “at” a URI on the web). One page might be regularly observed to have the same author and declarative content, while another (a group blog or news site, for example) might have frequently changing author and content. In either case the page is characterized by its lawful variation or lack thereof. Web pages may be one or two degrees abstracted from reality, but they are as real as many things we usefully talk about.

We can use one URI to refer to different things depending on context? – You can try, but there are some problems with trying to revoke the “U” in “URI”. One is that computers and programmers get things wrong, and the likelihood that the context might be misread makes for a fragile system. Another is that context sensitivity is a threat to polymorphism (functions that are generic across domains) and to interoperability (combining functions across domains).

And don’t get me started on what the specs say.

We can say a URI refers to an ontological chimera of a web page and something else? – That is, http://dbpedia.org/page/Paris might simultaneously have both authors and a population? The authors of the page, that is, and the population of the city. That sounds confusing, what kind of thing would you say it is, a pagity? Worse, what do you do when the “something else” is itself similar to a web page, as in the example at the top, and there are statements (author, subject matter, date, etc.) that are true of the snake and false of the goat?

Regularities are unknowable? – If a web page is defined by its regularities, and regularities are universally quantified statements (e.g. the author of this page is always Jonathan Rees), and universal propositions are unknowable (cf. Popper), then isn’t the “web page” idea useless? Well, regularities of physically realized web pages are in the realm of hypotheses about the future that can’t be proven, only falsified. The answer to that is that we make statements like this about reality all the time, and do useful inferences with them, subject to whatever epistemological framework we choose to work with. We try not to be wrong, but if we are, we fix our mistakes.

How do regularities come to be known (or hypothesized)? - Maybe someone tells you (like the individual who deployed the page, or a page that links to the page), and you find them credible. Or, you make observations (look at the page a few times) and form an idea. Or maybe it resembles some other web page you know a lot about, and you’re willing to make assumptions based on that. Or maybe you control the page yourself and can cause it to have whatever properties you like.

We don’t really use URIs to refer, only to locate? – That is, the example at the top might be expected to lead to confusion, and to be clear we really ought to say “the article at (URI)” or “the manifesto described at (URI)”, never (URI) without “at”. Maybe so, but retracting the use of URIs for reference would be a major redesign of the way we talk about the web (at least in IETF and W3C), and would obsolete most if not all RDF content.

But we often use URIs to refer to things that aren’t web pages? – Yes, but URIs that don’t refer to web pages aren’t made to look like them. Thus the practice of using HTTP 303, or “hash URIs” without a corresponding document fragment, as ways to avoid such an appearance. These techniques are kludges but historically they made the ‘semantic web’ possible since they didn’t rely on deployment of a new protocol or a revolution in the URIs are used.

I can’t teach my server to do 303 or fragment ids? - As these are the only ways at present to associate URIs with things other than web pages in a way that’s easily discoverable, you’ll need to either contract with someone else to deploy your URIs, forego discoverability (i.e. use RDF with unresolvable URIs – this is unpleasant but not technically wrong), or team up with others to convince the world to implement a new discovery protocol that you can implement (such as /.well-known/host-meta). The standards organizations may be the most effective way to do this.

However, I find it highly unlikely that neither discovery technique will work for you, as they’re not that tricky. Seek advice.

What “web page”, I thought they were “information resources”? - You get a more general theory if there is a class of things that includes both web pages and potential web pages: things that might not be on the web yet or ever, but could be. We might call these things “information resources”. The term “information resource” has a confusing history, and could easily be seen as incompatible with what I’m talking about, but if they sound the same to you that’s fine.

Categories: Uncategorized
Follow

Get every new post delivered to your Inbox.