Archive

Archive for February, 2011

The place of metadata

I am entering this picture into the blog mainly so I can refer to it if I need to.

Relationships between thing, IR, description, metadata

Relationships between thing, IR, description, metadata

The problem is that people often get confused about these relationships. Maybe I am confused and it’s catching, but I have no evidence of this.

Some people like to talk about “the metadata for a bridge”.  That really bothers me. Metadata in common usage (outside of circles of confused informatics weenies) means information about information. Look it up. So what’s really meant is “a description of a bridge” or “data about a bridge”.

Some people like to say that all information resources (document-like things) are descriptions. They aren’t. A symphony isn’t a description, nor is a word list.

The main message is the center axis but the side classes are added to document what another subclass choice would be in each case.

I don’t know much UML but I have picked up one habit, which is to use this kind of arrow (the one whose head looks like a triangle) to mean “is a subclass of” or “is a”. If it were a relation like others you see in diagrams like this where the nodes are classes, the relation would be “equals” where some of the things in the target are not equal to anything in the source. Using a different kind of arrowhead emphasizes the difference in kind. Relation links to me mean “some of these are [label]-related to some of those.” I could use UML or ER detailing (double-arrow and so on) to say whether it’s some or all in each case and whether the relationship is functional or inverse functional, but I don’t because I can never member what those funny annotations mean.

To relate to the FRBR discussion, a BR (bibliographic record) would be one kind of metadata record.

Categories: Uncategorized

A blowfly reflects

Data is (are) the prey and sustenance of science.  With vigilance, cunning, and skill, the lionesses detect and subdue it.  The alpha males get first access to the kill (and, strangely, credit for it), followed by the females and cubs. When they cannot be kept at bay any longer, the hyenas get to continue the lions’ work.  At that point, if we’re lucky, the remaining carcass is left to open access, and we vultures and blowflies can do with it what we will.

Categories: Uncategorized

FRBR and the Web

I’m going to assume some familiarity with FRBR, so if you want to read the below and don’t know FRBR, you might want to consult the FRBR specification.

The idea of FRBR is that when organizing bibliographic information into database records it’s useful to group the kinds of things you would typically say into four groups.  If you have a physical book, you might talk about (1) attributes specific to that physical copy, such where it is and what condition it’s in; (2) attributes shared by other copies in its print run or edition; (3) attributes shared with other editions such as title, date, and author; and (4) attributes shared with translations and adaptations, such as intended audience or form (e.g. novel vs. poem).  If you organize your database this way you will have four different bibliographic records applicable to the book.  Overall the system’s records will be arranged in a forest, so that for each more-specific record there is exactly one less-specific record.

Now the description of FRBR says that each record “is about” or “describes” some “entity”, an ontologically dubious proposition.  I would say FRBR entities are theoretical inventions, abstractions, or fictions, depending on my mood, and from an ontological point of view it’s not clear to me that saying the records are about things other than physical items is the most helpful way to think about the enterprise. But it’s not particularly harmful either.

For someone familiar with description logic (DL) a different framework presents itself based on these “functional requirements”.  You could say that your domain of discourse consists of FRBR Items, physical carriers of information.  If you state a set of properties (such as a number of properties at a single FRBR level) you have defined a class of Items having those properties, so any of the four three other kinds of record is in effect a class definition.  An “embodies” or “realizes” relation between FRBR “entities” is really a subclass relation between classes of Items.

However the two formulations are interdefinable.  Forgive me if I switch back and forth.

This brings me to the Web.  If I write an article in my word processor, and save it to disk, there are two Items, the copy in main memory and the copy on the disk.  Put it on a Web server and many more Items (copies) can be made, but (in FRBR terms) they all “exemplify” a single Manifestation.  In DL terms you’d say they share Manifestation-level properties, and all things that have those properties form the class corresponding to the Manifestation.  In the HTTP protocol many copies of the source Item are made as the information passes through network buffers, proxies, caches, and application libraries, but they all still exemplify the same Manifestation.  So it seems safe to say that the any single GET/200 exchange is associated with a single Manifestation, at least when the response has any resemblance to something you’d be concerned with in a “bibliographic record”.

Now if you do multiple HTTP GETs of the same URI, you may always see the same Manifestation, in which case you might say there’s a special relationship between that URI and the Manifestation.  If on the other hand, there is variation between the Manifestations you get but they all embody a single Expression (for example, you get the same words encoded differently – say text/plain and text/html), then there would seem to be a special relationship between that URI and the Expression.  Similarly for Work – content negotiation could get you Expressions in different languages.

Whether there is response variation, and how widely it extends, is impossible to determine using HTTP GET alone (even with the help of Vary: headers), so the truth value of these ‘cozy’ relationships is unknowable for Popperesque reasons.  I may issue a thousand requests and always get the same Manifestion, but get a different one on request number 1001.  So if I say that a URI is cozy with a Manifestation, I had better either have inside knowledge of how that web server is configured (e.g. I could be the one running it), or else I need to be prepared to be proven wrong.

For some URIs there is no such cozy relationship of the URI to any FRBR entity.  There is always the aggregate collection of all the received Manifestations (or Expressions or Works), but the relationship between the Manifestations you get and the aggregate is different, part-of rather than embodies (etc.).  For example, different blog posts can come from the same URI at different times, but they are parts of the blog, not embodiments (realizations, etc.) of it.  FRBR would describe the blog posts and the blog very differently.

The Manifestation “cloud” around some URIs seems to fall outside of FRBR entirely.  Consider a page inside a bank’s web site for current account balances.  From a single URI, different users get different pages depending on their session authentication.  This is not really a serial publication, as pages might be issued simultaneously, and it is not really a collection since the pages aren’t collected together in one place.  So the URI does not seem to be ‘cozy’ with any FRBR entity.  Hoping that FRBR will be helpful in understanding all Web URIs is probably too much to ask.

We can connect some of this to web architecture rhetoric.   The webarch theory holds that there is something that the URI “identifies” and that it is an “information resource” that has “representations”.  It seems consistent to say that a “representation” is close to what FRBR calls a Manifestation, and that FRBR Manifestations, Expressions, and Works are all can all be “information resources” since any of the three can cozy up to a URI.  (Compare TimBL’s “Generic Resources” note.)  Undoubtedly there are other “information resources” as well but they may either correspond to FRBR aggregates swapping “exemplifies” and so on for “part of” in the way “representation of” works, or they may be undescribable using FRBR.

[2/13: Alan R points out that according to FRBR not all Expressions have Manifestations.]

In the “semantic web” world, where a URI is used not just with HTTP but as a name that refers to something, one could use a Web URI to refer to the Manifestation you get, its Expression, or the Expression’s Work, depending on the range of Manifestations that you get from GET/200 (and given the limitations described above).  This is where the DL Item-class approach wins, because it lets you ascribe properties to “information resources” without having to commit to any particularly level of the FRBR hierarchy, and thus without having to be conscious of the FRBR record modularity.  You can write down an Expression-level property, and then if you later find it’s better to use the URI to refer to a Manifestation that’s not a problem.  Of course the other direction doesn’t work so well. (I’m pulling a fast one here – I need to write another blog post on generic individuals.)

For a more sophisticated picture of how the Web relates to documents, see Henry Thompson’s article on URIs.

I’m surely not the first to think about all this, but I got tired of research after two Google searches.

Thanks to Allen Renear for prompting these thoughts and being interested, Tim Danford for “cryptic class”, and Alan Ruttenberg for null hypotheses.

Categories: Uncategorized