Decision time


So, it was decision time today. What shall our publically-published identifiers look like? For real, this time.

At work, we expose quite a bit of data to the web, especially to “The Semantic Web” (henceforth: ‘semweb’). The basic notion is that the semweb as a whole is a single, vast, extended triple-store with an object model on top of it. Identifiers identify atomic things, and these things have attributes with (possibly multivalued) values. The values may either be literals, or other objects.

The identifiers in this system are URIs. URIs are opaque and don’t mean anything. However, in the semantic web of linked data, there is a convention: idenifiers should be http URLs, and you should be able to perform an HTTP GET on them. If you do this, some service will spit back a 303 redirect to a URL which will give you – well – something meaningful about the URI. In practice, you get a HTML page or an RDF (or even JSON) document, depending on the HTTP headers.

In our system, the identified things have a three-part compound key.

They have a namespace. This namespace corresponds to a ‘shard’ of our system. A shard is both a collection of objects that someone is responsible for, and corresponds to an instance of our services running over the top of some particular database instance. Currently, we have an APNI (Austalian Plant Name Index) shard and an AusMoss shard.

They have an object type. We expose names (that is: scientific and other names of things biological), references, instances of names at a reference, and classifications of instances (ie: instances that someone has organised into a tree structure).

And, of course, they have an id. As it happens, the id is unique within the namespace.

(Physically, these three parts are schema, tablename, and rownum. We have a single id sequence for the schema.)

Up till now, we have been exposing URIs that look like this:

http://biodiversity.org.au/boa/name/apni/54321

When one of these URIs comes into the site, it is sent to the linked-data “mapper”, whose job is to reply to this URI with a URL that will get to our service engine. For instance, the mapper could (it doesn’t, but it could) respond with a redirect to:

http://cluster5.biodiversity.org.au/getobject?type=name&id=54321

The mapper’s job, in part, is to “know” that if the namespace is ‘apni’, then that is served up by the server sitting in cluster 5. This may change, and that’s ok. What matters is that the URI doesn’t.

Problem is, we don’t like our URIs. the ‘boa’ part means nothing. We could remove it, but this plays havoc with the reverse proxy. It means that every object type needs to be put into the Oracle reverse proxy config, and it means that we can’t manage our URL namespace freely when it comes to deploying services. And there’s a variety of other problems that were hashed out over the afternoon.

Eventually, the solution we chose was this:

http://id.biodiversity.org.au/name/apni/54321

This has a variety of advantages.

Although URIs are in theory opaque, of course they are not in practise. People who look at a URL would like to have some notion of what they are going to get back before they click the link. These URIs do that. You can read it. It’s an id. Of a name. Belonging to apni.

We can proxy all things that might be linked data URIs through to our linked data service without having to worry any further about it.at the firewall. Our semantic web portal has its own little world in which it can do whatever it likes, and that’s fine.

We can rearrange our web pages and deploy URL prefix spaces at biodiversity.org.au freely, without the possibility that we migh need to reorganise everything should we – for instance – decide to publish identifiers for specimens. To put it another way: the biodiversity.org.au domain was doing two jobs: it was a namespace for identifiers, and it was also a working website with static pages, webapps, and so on. Now it only does the one job.

Because it has its own domain name, the linked data service can easily be detached and moved into the cloud. The service is very simple, but works with a huge amount of data. Hosting it specially on an entirely different ip and site may in future make a lot of sense.

And finally, this is a decent general solution. If you run somefrogs.org and want to expose linked data, id.somefrogs.org makes perfect sense. If everyone used an ‘id.X’ subdomain for linked data identifiers minted by organisation X, I think that would be swell.

In fact, with a bit of fiddling about you could point your id.somefrogs.org subdomain at our mapper in the cloud, and we could host your ids. Because it’s a subdomain, your website would not need any changes. We’d agree on a namespace id for your stuff, you’d explain how we convert your object type/id pairs into URLs, and off we go.

Anyway. It’s what we are doing.

Advertisements

One Response to Decision time

  1. .ghw says:

    So … why isn’t this write-up in the project documentation.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: