I probably shouldn’t tell you this, but I have a test instance of d2rq running against a test copy of the new database at http://biodiversity.org.au/d2rq.
I’m finding that whenever d2rq terminates a request due to a timeout, it seems also to be closing the database connection, or something. I’m not sure it manages transactions properly when its in a web container. Perhaps I need to give it different connection parameters – tell it to use JNDI, perhaps. A problem is that parts of its config measure time in seconds, other parts measure it in milliseconds. I personally have believed ever since 2002 that any variable holding a physical quantity must have a name suffixed with the unit. It would have saved that Mars mission.
By the time you read this, it has probably already crashed and will need to be bounced tomorrow. But hopefully not.
Most of my work over the past few weeks has been to build a d2rq mapping for our data, and to write a corresponding OWL vocabulary. I have not attempted to use existing vocablaries and fit them to what we are doing, instead opting to write a self-contained vocabulary. For terms that – hopefully – map to common things pertaining generally to taxonomy, see http://biodiversity.org.au/voc/boa/BOA.rdf, and the other files alongside it: Name.rdf, Author.rdf, Reference.rdf, Instance.rdf. Terms that are more specific to our National Species List application are in http://biodiversity.org.au/voc/nsl/NSL.rdf.
Naturally, it still needs cleaning up. Consider http://biodiversity.org.au/voc/boa/Name#Rank-visible. This boolean means “ranks like this need to be shown in the name” – eg, ‘var’. Does this belong in the boa vocabulary? Or is it an artifact of how we produce name strings, belonging in the nsl vocabulary? I don’t know. To a degree, it doesn’t really matter – the main thing is the overall organisation of the data as a model of how the business of taxonomy gets done, and the persistence of the URIs for the individual objects.
Before I continue on to what I actually wanted to write about, I see I need to justify not using existing vocabularies.
First: God knows, we tried. Taxon Concept Schema, OAI-PMH, the TDWG ontology, SKOS, LSIDs – it’s all in there in the current stuff. But there’s endless quibbling about whether or not what SKOS means by a ‘concept’ is really the same as a name or an instance of a name (aka: not all instances are taxon concepts), or whether names are really names in the sense intended by the vocabulary you are trying to use, or if some of them are, and if so which ones (An illegitimate name is a name, but is it really a name? Those kinds of discussions.). Multiply this by the total number of terms you are trying to borrow. Each vocabulary presents a list of ranks, of nomenclatural statuses and what have you, but those lists never quite match what we have. 80% are fine.
The underlying cause of this is that taxonomists – you won’t believe this – just make stuff up! That’s why there isn’t a single set of authoritative nomenclatural statuses. Oh, there almost is (he said, laughing bitterly) – but there’s always one or two that don’t quite fit.
The thing is: they’re not doing it for fun, or to be difficult. They are handling the material they have, which varies from place to place. They generate vocabulary because they have to.
Indeed – they have exactly the problem that the IT people have: sometimes, there is no existing term that wouldn’t be wrong. So every site winds up with idiosyncrasies that the computers must handle.
But you are always having to invent extra vocabulary to make things fit properly. We tried to use TCS and wound up putting chunks of information in the ‘custom data’ section (before giving up on TCS altogether because the schema is so tight that it’s very difficult to generate a TCS document that is correct).
The solution we are going with is just to expose the data we have, with a vocabulary that we publish at the site (currently full of “insert description here” descriptions), and to offload the job of determining whether what we mean is what a querent is asking about onto the querent.
Another job I’ve been meaning to do – demonstrate how to set up an empty instance of Fuseki, configure it to talk to our SPARQL endpoint and the dbpedia endpoint (to take the union of the graphs), and write a sparql query that finds macropods in NSW (according to our data) that have a conservation status of endangered according to dbpedia.
Come to think of it – how about finding names on dbpedia that use synonyms rather than our accepted names? Maybe we could give some starry-eyed undergrads the job of fixing them on Wikipedia in exchange for course credit. Everyone’s a winner.
If you want to write a query across our SPARQL server and dbpedia, by all means go for it. Provided that we keep the ids for the actual data items persistent, we can fool around with the triples we serve up for them at a later stage.
Second, and rather more interestingly: some of what we are doing is – as far as I can make out – new. Some of this new stuff is what I am hoping to write about.
Looking back, this is a little long for a blog post, and if I go on to discuss what I actually want to discuss, then this post would have two separate and almost completely unrelated topics.
So I will change the title of this post and start a new one about our concretization of relationship instances, including the vexing question of what I am supposed to call doing it. ‘Concretization’ is just plain wrong.
— EDIT —
I think I am going to have to go with “realize”. It seems odd, because it’s the object of the verb “abstract”, but it’s the only thing that fits.