How to apply a CC0 waiver to an ontology
OK, there are two issues, one being what statements (triples) are needed in order to assert the waiver, the other being where to put them.
If there is a “landing page” for the ontology then CC Rel by Example gives a good start at documentation for what to do. It tells you the operative statement, which is
<uri-of-file-containing-ontology>
xhv:license
<http://creativecommons.org/publicdomain/zero/1.0/>.
where xhv: abbreviates http://www.w3.org/1999/xhtml/vocab# .
Ideally you would assert this predicate and object for both the ontology (via its ontology URI) and the ontology version (if the version has its own URI), repeating for as many aliases as you know about. (Ontology versions are a particular feature of OWL 2, not of RDF.) You want to cover as many bases as you can. So you could end up with many statements like this.
Similarly, you want to put these statements in as many places as you can, not just the ontology file itself but also any landing page that it might have (as shown in RDFa in the ccrel-guide).
Putting statements into an RDF serialization (e.g. RDF/XML) is straightforward, as shown, if you are editing the serialization directly. But if you are using an OWL tool such as Protege, it
could be harder. Protege gives you two methods that might be used, ontology annotations and individual property assertions. You can use the ontology annotation pane to add as xhv:version to the ontology, but not the ontology version. To add individual property assertions for the ontology version you may have to put the three or more URIs in the ontology itself, which would just be tedious clutter, but I don’t see another choice.
Sadly all this work is speculative as there are no tools at present (of which I’m aware) that would pick up on the CC0 annotation. That’s not to say you shouldn’t do it, in fact I’m glad someone is willing to be a pioneer, as it will be a chicken-and-egg situation for quite a while.
In addition to expressing the waiver in RDF I would recommend writing a copyright statement in prose in an rdfs:comment ontology annotation property. The RDF statements themselves are likely to get lost or ignored, but with the rdfs:comment you have humans on your side. For wording you could use that given in the CC Rel guide or by the CC0 ‘chooser’ tool.
All of the above also applies if you’re attaching CC-BY or some other waiver or annotation, but ontologies are going to be easier to work with if they’re unencumbered, and the whole reason you wrote the ontology was so that it would be used, right?
Exercise for the adventurous reader: How does this approach fail if the httpRange-14 resolution‘s advice isn’t observed?
Thanks to Ruth Duerr for asking.
Tough URLs
Henry Thompson and I have been puzzling for a few years over the question of why the Web doesn’t have URIs that are widely perceived as both robust – in the sense of resisting attacks such as expiration, corruption, and censorship – and actionable – in the sense of just working in the browser. We have identifiers systems that are one or the other, but not both – why?
The robust identifier systems that we have range from pre-Internet ones like Linnaeus’s binomial species names (which are tied to their priority literature reference), the chemical element symbols, ISSN, and so on, to modern inventions such as URNs, info: URIs, and the digital object identifier (DOI). Our actionable identifiers (or locators?) are things like http: and ftp: URIs – a notably disjoint set.
Why should anyone care about robust actionable URIs? The reason is that, if they existed, they would marry a cornerstone of civil discourse with to the central modern communication technology, namely the Web.
We take robust reference for granted in everyday civil, legal, scientific, technical, and political discourse, so much so that it is not even called out as a named phenomenon. If you’re debating a law or a scientific article with someone, the last thing you want is for your argument to go wrong because the two sides are working from different documents – especially if the difference goes undetected. This would be stupid.
But reliable reference was not always the rule. It took the world hundreds of years following the invention of the printing press to deal with this problem. Now we are repeating the reference chaos of reference in the early print world on modern technology.
References are easy to deal with if you are a human, speaking natural language, with a bit of time on your hands. If you see the species name Rana pipiens and know a little bit about how species names work you can look it up to get the primary reference for that name. Each identifier system has its own set of resolution services, many of them on the Web and open. But informal references in dozens of different identifier systems is not the same as being first-class citizens on the Web – as I say human intervention is required. Making references accessible to computers using ordinary (i.e. Web) protocols vastly accelerates any process that needs to follow them. And to do this, today, you need something that starts with http:// and a domain name.
By now you have no doubt found many ways to poke holes in what I’ve said so far. Are “tough” URIs really possible? What exactly could that mean? Isn’t it impossible to eliminate all vulnerabilities? On the other hand, given that the examples of robust mostly are, isn’t a URI such as
http://dx.doi.org/10.1155/1987/47105
a counterexample to my claim that we don’t have robust actionable URIs? And if this is such a problem, why on earth hasn’t it been solved already? Is it inherently intractable or is this some kind of awful techno-social mistake that can be fixed?
What interests me is a sweet spot in between these two extremes: more robust that current-day doi.org URIs, but admitting the unavoidable inevitability of certain vulnerabilities.
OK, I have more to say about threat analysis, IDF, ICANN, P2P, and so on, and will do so in a followup. In the meantime – if you want to talk about this, please come to our workshop in Bristol, UK, on December 8th!
[Minor copy edits on 2012-08-02]
Leveling the field for open access
Do researchers prefer to publish in closed access journals rather than open access in order to avoid OA publication charges? Librarians would certainly prefer they publish open access, since OA reduces their costs (in the long run) by reducing their subscription burden. I don’t know the answer, but if this is happening, universities might take steps to eliminate the incentive.
Here’s an idea: For each closed access article published, the university assesses a “subscription tax” comparable to what would have been assessed had the article been published open access, or maybe higher. That is, you can continue to support the subscription model, but you’ll have to pay for it, just as those publishing open access have to pay for open access.
The subscription tax goes to the libraries and is used to pay for subscriptions. The university overhead rate can be reduced, for everyone, by the amount raised through this new revenue stream.
Professor Smith’s decision between closed and open can now be made without financial bias. Currently she’ll pay (from her grant) $50,000 overhead plus $1,500 for open access or $0 for closed access. This would change to $48,500 overhead plus $1,500 for open access or $1,500 tax for closed access. Her grant officer is happy because the switch from overhead to article charge-or-tax is dollars-neutral, Smith is happy because she doesn’t need to factor open/closed into her venue decision, and the librarian is happy because more OA publishing is taking place and subscription load is dropping.
I know my arithmetic probably doesn’t come out right, but I hope you get the idea.
I’m sure someone has already thought about this, but this is a blog so I get to write things like the above without bothering to do background research.
(Inspired by Francis Pinter’s video and the recent Guardian piece.)
Authorization and meaning
Larry,
I’m reviewing the issue-57 discussion at the recent TAG F2F and I notice that you hammer repeatedly against the connection between meaning and service:
- I don’t think it should cost energy to mean something
- I’m concerned [about] the way you’re describing the pattern – [dependency on running web server] … if you do a GET on the string, web server must be running when you do that GET.
and so on.
These IRC comments were ignored at the time. Now I will try to answer them.
If I ever talk about meaning being determined by GET, it is only a shorthand for a more nuanced story about authorization; I don’t really mean it. I’m sorry if that’s confusing.
My understanding of this from HTTPbis is that a “representation” (or other response), over whatever protocol (including any inter-brain protocol), is authorized for an http: URI by the domain name owner. That is, it is not correct for a cache or proxy to deliver a representation for a dereference of that URI that is not so authorized. The HTTP protocol is one way to express such an authorization, and because of Expires: the authorization can last for up to a year. But it does not take any energy for a representation to *be* authorized for a URI. The domain owner’s server can shut down completely, but copies cached in disks or on the sides of buses or inside brains can continue to be authorized.
You might even be able to find out whether a representation is authorized by, say, calling the current domain owner on the phone (not sure whether HTTPbis allows this). And other URI schemes have their own ways for a representation to be authorized. For example, RFC 2397 and your duri: draft authorize representations for a whole bunch of URIs for a very long time (forever).
But regardless of the URI scheme, representation-related meaning is either cached (maybe in your brain), looked up, or calculated by an authorized formula (as in the case of data:), and while perhaps no energy is needed for meaning itself, there is no caching or lookup or calculation apparatus that does not require energy to maintain. (Of course you know this distinction, sorry if I seem to lecture.)
If there were a legitimate way to authorize an http: representation for a very long period of time, such as true domain name ownership, then maybe we wouldn’t have to worry so much about meanings in http: space changing every year. RFC 2397 seems to do a pretty good job of authorizing representations forever; but the future is inherently unpredictable, so even the meanings of data: and duri: URIs is uncertain. Perhaps their meanings will be redefined by HTML6, in order to accommodate the way deployed infrastructure understands them. … unless by “meaning” you mean duri:2011:word:meaning… oh wait…
Now I’m mostly with you on http: not necessarily being the best basis for civil discourse (which requires citations, quite separately from up-to-date links), and duri: being superior in many ways. But issue-57 is not the best place, in my opinion, to address persistence concerns, given choices that have already been made by the affected community. I think it belongs with issue-50.
(Footnote: by “authority” throughout I mean the nice kind, the kind that’s granted, not imposed.)
(Footnote: someday I’ll figure out a ZBAC way to analyze all this. Not there yet.)
Crossref’s gift of metadata
I was delighted to learn of Crossref’s April 20 announcement (press release ; Geoff Bilder’s blog post) that they are making their DOI metadata available in RDF via HTTP. This is a significant development for scholarship on the Web and an important step toward a fully open and reliable scholarly edifice.
For those of you not familiar with this database, it has about 46 million records (and growing), keyed by strings called “digital object identifiers”. DOIs are similar to the ISBNs used for books, but are applied at a finer level of granularity – mainly for academic research articles published in the past 10 years, but with coverage steadily growing. Each record has basic bibliographic metadata for its “object” such as author, title, publisher, publication date. For an example try
curl -D – -L -H “Accept: text/turtle” “http://dx.doi.org/10.1155/1974/82714″
(This Google 1-gram, although it ony reflects occurrences of “DOI” in books, hints at the growing popularity of DOIs.)
The value of the database derives in large part from the strength of Crossref’s publisher rules, which help guarantee DOI uniqueness and metadata quality.
Open access to metadata is not as wonderful as open access to the content of the articles, but it’s an important toehold. For example, DOI metadata may be what enables an automated assistant to find a copy of an article in a library collection you have access to, or to find data sets or database accessions that come from it or refer to it.
Crossref’s announcement is much more important than your run of the mill open data announcement, for a variety of reasons. First, the data is central, since the literature is a hub for other kinds of information. This database describes the scholarly literature, the backbone of research. Nearly anything you want to say or record, as a scholar, either derives from the literature or uses it as evidence. DOI metadata helps make all kinds of statements more concise, rigorous, and machine-friendly.
Second, the data can be used by a wide variety of tools. Reference managers such as Mendeley and Zotero already access DOI metadata – I’m not sure how, possibly using older the password protected OpenURL interface in some sneaky way – in that you can give them a DOI and they will automatically fill in author, title, and so on in a reference list. But now all sorts of other tools will be able to do the same sort of thing. I imagine Crossref’s service becoming standard in all sorts of annotation and social networking tools, database front ends, and so on. Rather than scraping this information from web pages, a tool can just find or accept a DOI, and obtain the metadata from Crossref.
Third, it suggests that we may be getting closer to bulk download and open mirrors for Crossref’s data. Such mirrors will be necessary not only for use in citation network research and integration into other databases, but also in order to protect the DOI system from attack. Given the international nature and inherent skepticism of the scholarly community, it is important that access to the metadata not be vulnerable to the administrative, technical, or legal failure of Crossref or its supporting infrastructure. Lots of copies of the database would mean protection against such failures. [removed sentence 5/5]
Fourth, this information is interesting enough that developers who have previously stayed away from RDF and LOD will now link in an RDF parser as a means to an end, not an end in itself. This ought to be a boost to the LOD world, which in my mind is dominated by solutions in search of a problem.
Fifth, it is very cool that Crossref is observing the “httpRange-14 resolution”, which in effect says that metadata applies to normal Web pages, by using HTTP 303 responses to flag a not-so-normal situation. The 303 ensures that the URI form of a DOI refers to the article itself, even if it’s behind a paywall, not to the landing page that you might arrive at when dereferencing the URI. Crossref could easily have taken the low road and kept the 302 redirects they were using before, but that would have led to confusion over whether the metadata applied to the landing page or to the article, and they had the wisdom to foresee this. This is a subtle point of Web architecture and I’m glad they got it.
It would be nice if the metadata pages asserted their own legal status, preferably using a CC0 waiver. This is probably not necessary since the information is factual and (IANAL) not protected by copyright law, but clarity is always welcome. This issue is endemic to all open data, so I will take it up another time and not single out Crossref.
Are you confused yet about the word “representation”?
Wordnet entry for “representation”.
S: (n) representation, mental representation, internal representation (a presentation to the mind in the form of an idea or image)
S: (n) representation (a creation that is a visual or tangible rendering of someone or something)
S: (n) representation (the act of representing; standing in for someone or some group and speaking with authority in their behalf)
S: (n) representation, delegacy, agency (the state of serving as an official and authorized delegate or agent)
S: (n) representation (a body of legislators that serve in behalf of some constituency) “a Congressional vacancy occurred in the representation from California”
S: (n) representation (a factual statement made by one party in order to induce another party to enter into a contract) “the sales contract contains several representations by the vendor”
S: (n) theatrical performance, theatrical, representation, histrionics (a performance of a play)
S: (n) representation (a statement of facts and reasons made in appealing or protesting) “certain representations were made concerning police brutality”
S: (n) representation (the right of being represented by delegates who have a voice in some legislative body)
S: (n) representation (an activity that stands as an equivalent of something or results in an equivalent)
Niklaus Wirth.
Program Development by Stepwise Refinement.
Communications of the ACM, Vol. 14, No. 4, April 1971, pp. 221-227.
In order to refine these instructions and predicates further in the direction of instructions and predicates available in common programming languages, it becomes necessary to express them in terms of data representable in those languages. A decision on how to represent the relevant facts in terms of data can therefore no longer be postponed.
Tim Berners-Lee.
Generic Resources.
Web page, 1996-2009.
A resource may be generic in that as a concept it is well specified but not so specifically specified that it can only be represented by a single bit stream.
T. Berners-Lee, R. Fielding, and H. Frystyk.
Hypertext Transfer Protocol — HTTP/1.0.
RFC 1945, IETF, May 1996.
A feature of HTTP is the typing of data representation, allowing systems to be built independently of the data being transferred.
R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee.
Hypertext Transfer Protocol — HTTP/1.1.
RFC 2616, IETF, June 1999.
A feature of HTTP is the typing and negotiation of data representation, allowing systems to be built independently of the data being transferred.
Resources may be available in multiple representations (e.g. multiple languages, data formats, size, and resolutions).
representation [definition from glossary]:
An entity included with a response that is subject to content negotiation.
Tim Berners-Lee.
“The range of the HTTP dereference function.”
Email to www-tag list, March 2002.
HTTP is a protocol which provides, for the client, a mapping (the http URI dereference function) from URI starting with “http:” and not containing a “#” to a representation of a document. The document is the abstract thing and the representation is bits.
Roy T. Fielding and Richard N. Taylor.
Principled Design of the Modern Web Architecture.
ACM Transactions on Internet Technology (TOIT), 2002.
[JAR comments inserted]
… allowing a user to progress through the application by selecting a link or submitting a short data-entry form, with each action resulting in a transition to the next state of the application by transferring a representation of that state to the user.
[A representation is of a state.]
REST components communicate by transferring a representation of the
data …
[A representation is of data.]
Finally, it allows an author to reference the concept rather than some singular representation of that concept, thus removing the need to change all existing links whenever the representation changes.
[A representation is of a concept.]
Depending on the message control data, a given representation may indicate the current state of the requested resource, the desired state for the requested resource, or the value of some other resource, such as a representation of the input data within a client’s query
form, or a representation of some error condition for a response.
[A representation indicates a state.]
… the specification of Web addresses also defines the scope and semantics of what we mean by resource, which has changed since the early Web architecture. REST was used to define the term resource for the URI standard [Berners-Lee et al. 1998], as well as the overall semantics of the generic interface for manipulating resources via their representations.
[A resource can be manipulated via a machine interface.]
A resource does not always map to a singular file, but all resources that are not static are derived from some other resources, and by following the derivation tree an author can eventually find all of the source resources that must be edited in order to modify the representation of a resource. [emphasis JAR's]
[A resource is derived from editable sources.]
Semantics are a byproduct of the act of assigning resource identifiers and populating those resources with representations. At no time whatsoever do the server or client
software need to know or understand the meaning of a URI — they merely act as a conduit through which the creator of a resource (a human naming authority) can associate representations with the semantics identified by the URI. In other words, there are no resources on the server; just mechanisms that supply answers across an abstract interface defined by resources. [emphasis JAR's]
[Resources do not reside on servers.]
Ian Jacobs and Norman Walsh, editors.
Architecture of the World Wide Web, Volume One.
W3C Recommendation, December 2004.
[Document approved for advancement to Proposed Recommendation by Roy Fielding and the rest of the TAG.]
[Glossary] A representation is data that encodes information about resource state. Representations do not necessarily describe the resource, or portray a likeness of the resource, or represent the resource in other senses of the word “represent”.
[1] In this travel scenario, the resource is a periodically updated report on the weather in Oaxaca…
[2.2] In the case of this document, the message payload is the representation of this document.
Tim Berners-Lee, Roy Fielding, and Larry Masinter.
Uniform Resource Identifier (URI): Generic Syntax.
RFC 3986, IETF, January 2005.
When URIs are used within information retrieval systems to identify sources of information, the most common form of URI dereference is “retrieval”: making use of a URI in order to retrieve a representation of its associated resource. A “representation” is a sequence of octets, along with representation metadata describing those octets,
that constitutes a record of the state of the resource at the time when the representation is generated.
Tim Berners-Lee.
The meaning of “representation”.
Email to www-tag list, November 2007.
In fact, the relationship includes social as well as technical aspects. It also is defined, often, by high-level protocols. These higher level protocols set common expectations between the publisher and the reader. These are not consistent across the web. That is why you can’t simplistically just give a formula for that relationship.
…There is information in it in RDF which says that I made it and it is my public profile document.
…These expectations are very important to the web working. That’s why you can’t just write down a formula for the constraint on the representations of a given resource.
Roy Fielding.
“Some notes on organizing discussion on WebApps architecture.”
Email to www-tag list, October 2010.
On Oct 14, 2010, at 2:11 PM, Larry Masinter wrote:
Well, I wonder if we might introduce another step between “resource” and “representation” which is “application resource in identified state”, so that the representation isn’t a representation of the resource, but a representation of the resource in that state.
Umm, what? That would be terribly confusing and contrary to why I used the term representation in the first place (it is a representation by the origin server to the recipient of the state of that identified resource at the time of message generation).
The place of metadata
I am entering this picture into the blog mainly so I can refer to it if I need to.
The problem is that people often get confused about these relationships. Maybe I am confused and it’s catching, but I have no evidence of this.
Some people like to talk about “the metadata for a bridge”. That really bothers me. Metadata in common usage (outside of circles of confused informatics weenies) means information about information. Look it up. So what’s really meant is “a description of a bridge” or “data about a bridge”.
Some people like to say that all information resources (document-like things) are descriptions. They aren’t. A symphony isn’t a description, nor is a word list.
The main message is the center axis but the side classes are added to document what another subclass choice would be in each case.
I don’t know much UML but I have picked up one habit, which is to use this kind of arrow (the one whose head looks like a triangle) to mean “is a subclass of” or “is a”. If it were a relation like others you see in diagrams like this where the nodes are classes, the relation would be “equals” where some of the things in the target are not equal to anything in the source. Using a different kind of arrowhead emphasizes the difference in kind. Relation links to me mean “some of these are [label]-related to some of those.” I could use UML or ER detailing (double-arrow and so on) to say whether it’s some or all in each case and whether the relationship is functional or inverse functional, but I don’t because I can never member what those funny annotations mean.
To relate to the FRBR discussion, a BR (bibliographic record) would be one kind of metadata record.

