Crossref’s gift of metadata
I was delighted to learn of Crossref’s April 20 announcement (press release ; Geoff Bilder’s blog post) that they are making their DOI metadata available in RDF via HTTP. This is a significant development for scholarship on the Web and an important step toward a fully open and reliable scholarly edifice.
For those of you not familiar with this database, it has about 46 million records (and growing), keyed by strings called “digital object identifiers”. DOIs are similar to the ISBNs used for books, but are applied at a finer level of granularity – mainly for academic research articles published in the past 10 years, but with coverage steadily growing. Each record has basic bibliographic metadata for its “object” such as author, title, publisher, publication date. For an example try
curl -D – -L -H “Accept: text/turtle” “http://dx.doi.org/10.1155/1974/82714”
(This Google 1-gram, although it ony reflects occurrences of “DOI” in books, hints at the growing popularity of DOIs.)
The value of the database derives in large part from the strength of Crossref’s publisher rules, which help guarantee DOI uniqueness and metadata quality.
Open access to metadata is not as wonderful as open access to the content of the articles, but it’s an important toehold. For example, DOI metadata may be what enables an automated assistant to find a copy of an article in a library collection you have access to, or to find data sets or database accessions that come from it or refer to it.
Crossref’s announcement is much more important than your run of the mill open data announcement, for a variety of reasons. First, the data is central, since the literature is a hub for other kinds of information. This database describes the scholarly literature, the backbone of research. Nearly anything you want to say or record, as a scholar, either derives from the literature or uses it as evidence. DOI metadata helps make all kinds of statements more concise, rigorous, and machine-friendly.
Second, the data can be used by a wide variety of tools. Reference managers such as Mendeley and Zotero already access DOI metadata – I’m not sure how, possibly using older the password protected OpenURL interface in some sneaky way – in that you can give them a DOI and they will automatically fill in author, title, and so on in a reference list. But now all sorts of other tools will be able to do the same sort of thing. I imagine Crossref’s service becoming standard in all sorts of annotation and social networking tools, database front ends, and so on. Rather than scraping this information from web pages, a tool can just find or accept a DOI, and obtain the metadata from Crossref.
Third, it suggests that we may be getting closer to bulk download and open mirrors for Crossref’s data. Such mirrors will be necessary not only for use in citation network research and integration into other databases, but also in order to protect the DOI system from attack. Given the international nature and inherent skepticism of the scholarly community, it is important that access to the metadata not be vulnerable to the administrative, technical, or legal failure of Crossref or its supporting infrastructure. Lots of copies of the database would mean protection against such failures. [removed sentence 5/5]
Fourth, this information is interesting enough that developers who have previously stayed away from RDF and LOD will now link in an RDF parser as a means to an end, not an end in itself. This ought to be a boost to the LOD world, which in my mind is dominated by solutions in search of a problem.
Fifth, it is very cool that Crossref is observing the “httpRange-14 resolution”, which in effect says that metadata applies to normal Web pages, by using HTTP 303 responses to flag a not-so-normal situation. The 303 ensures that the URI form of a DOI refers to the article itself, even if it’s behind a paywall, not to the landing page that you might arrive at when dereferencing the URI. Crossref could easily have taken the low road and kept the 302 redirects they were using before, but that would have led to confusion over whether the metadata applied to the landing page or to the article, and they had the wisdom to foresee this. This is a subtle point of Web architecture and I’m glad they got it.
It would be nice if the metadata pages asserted their own legal status, preferably using a CC0 waiver. This is probably not necessary since the information is factual and (IANAL) not protected by copyright law, but clarity is always welcome. This issue is endemic to all open data, so I will take it up another time and not single out Crossref.