SKOS and classifications

April 29, 2015

“Oh, and we’d really like SKOS as well”.

Riiiight.

We have names and references.
Names appear in references at “instances” of a name.
Some name instances are taxonomic concepts.

We have a classification made up of tree nodes.
A tree node has many subnodes.
A tree node may be used by many supernodes as a subtree.
A tree node (for classification trees) refers to a taxonomic concept instance.

Now here’s the thing.

The basic job of SKOS in our data is to say “this concept is narrower-than that concept”. X ∈ A → X ∈ B. Naively, we’d put this over the node ids and say job done.

But our tree nodes are most certainly not taxonomic concepts. They are placements of concepts in some classification. It’s the concepts that are concepts. So the skos output must declare that it is the taxon concepts that are narrower-than other taxon concepts.

The problem becomes that the data becomes a mishmash. If a concept is moved from one place to another, then our classification data (which holds the entire history of a tree) will declare in one node that A⊂B and in another that that A⊂C, where B and C are different taxa at the same level.

Not good.

The problem becomes clearer when you consider having two entirely separate and incompatible classifications. It makes sense to say A⊂B according to the people at NSW University, but A⊂C according to the Norwegian National Classification of Herring. These classifications are sources of assertions that you can choose to trust or to not trust.

The nature of our classification nodes becomes clear. Each node is a document – a source of triples – that asserts certain facts and that trusts the assertion of facts in the nodes below it.

All I need to do is to find a suitable ontology for expressing this. The W3C provenance ontology looks promising. It even has – Oh My Lord! I has these predicates

prov:wasRevisionOf prov:invalidatedAtTime prove:wasInvalidatedBy

Which match exactly certain columns in my data.

(edit: why is this important? It indicates that the provenance ontology and my data may very well be talking about the same kinds of thing. If your data has a slot for “number of wheels” and “tonnage” and so does mine, there’s a good chance we are both discussing vehicles.)

I’m pretty comfortable, at this point, deciding that each tree node in a classification is a prov:Entity object. In fact, it seems to be a prov:Bundle in the sense of its subnodes, and a prov:Collection in the sense of the reified subnode axioms, which are skos terms.

I’m probably over-engineering things again. But there you go.


PS: not all subnode relationships are skos broader/narrower term relationships. In particular, hybrids. A hybrid cucumber/pumkin is not simply a kind of cucumber in the same way that a broadbean is a type of legume. I cannot simply assert skos inclusion over a tree – the actual link types and node types matter.


Twitter

March 5, 2015

Our team’s twitter account is https://twitter.com/AuBiodiversity . Follow for updates in regard to … stuff. Biodiversity stuff.


JENA, D2R, TDB, Joseki – happy together at last

March 3, 2015

Well!

The magic formula tuned out to be:

  1. Run joseki
  2. Using the d2rq libraries
  3. And just the two joseki jars containing the actual server
  4. Use a new version of TDB, which uses the new ARQ

And she’s up. Dayum. I have the static vocabulary files, the preloaded old APNI, AFD, and 2011 Catalogue of Life, and a live link to our (test) database, and you can write SPARQL queries over the lot. No worries. A trifle slow if you do stuff that spans the data sources.

Now the boss wants me to write some user doco for it. Item 1 is why this stuff matters (it matters because it’s our general query service). So rather than explaining it all here in my blog, I should do it on confluence.


JAR file hell with Jena and D2RQ

February 24, 2015

I would very much like for D2RQ to work as a subgraph inside an existing joseki installation. But sweet baby Jesus I just can’t make the nine million JAR file libraries talk to each other.

Tried to put the D2RQ assembler into joseki. Won’t run without an ‘incubation’ library. Blows up with a class file version exception, which means that the jars were compiled with other versions. Which is nuts, because the version of joseki I am using – 3.3.4 – is the same as is internally inside d2rq.

Tried to put the D2RQ assembler into fuseki. Fuseki attempts to call “init()” on the assembler, which ain’t there. According to the d2rq project, there is no such method on the interface, so clearly d2r was compiled to a different specification of jena than was fuseki.

Tried to launch the joseki that is inside the d2r installation (which obviously works) as a joseki instance rather than a d2r instance. Nope. joseki.rdfserver isn’t there.

Tried to get the d2rq and joseki source so as to compile them together on the same machine. But the build file specifies java versions, the git project HEAD points to a dev branch, and the joseki project isn’t even git.

I am at the stage of hacking up the d2r server code itself – it has the joseki and the d2rq classes living with one another, I have the source and it all compiles and builds ok. Issue is that when it launches, the “go and do it” class creates the d2r graph as a top-level object and as a default graph (from a sparql point of view). This won’t do – I need a top-level graph that is a stuck-together frankenstein that has the d2r component as a mere subsection of what is going on. The “go and do it” returns D2RQModel rather than the interface Model. Happily, I can fix at least that and it still compiles. So maybe I can build the graph that I want internally. But this means learning the programatic interface to jena – I already have assemblers that are correct (it’s just that they won’t run withou colliding into class file version issues). Perhaps just find the source of joseki.rdfserver and copy/paste it into the d2r project? Maybe that’s got a magic “read an assembler and spark up a SPARQL service” method?

If anyone out there has managed to get the d2r assembler working inside fuseki or joseki, or for that matter any implementation of a sparql endpoint, I would be terribly grateful for some tips.


RDF vocabulary – still a problem

February 11, 2015

I probably shouldn’t tell you this, but I have a test instance of d2rq running against a test copy of the new database at http://biodiversity.org.au/d2rq.

I’m finding that whenever d2rq terminates a request due to a timeout, it seems also to be closing the database connection, or something. I’m not sure it manages transactions properly when its in a web container. Perhaps I need to give it different connection parameters – tell it to use JNDI, perhaps. A problem is that parts of its config measure time in seconds, other parts measure it in milliseconds. I personally have believed ever since 2002 that any variable holding a physical quantity must have a name suffixed with the unit. It would have saved that Mars mission.

By the time you read this, it has probably already crashed and will need to be bounced tomorrow. But hopefully not.

Most of my work over the past few weeks has been to build a d2rq mapping for our data, and to write a corresponding OWL vocabulary. I have not attempted to use existing vocablaries and fit them to what we are doing, instead opting to write a self-contained vocabulary. For terms that – hopefully – map to common things pertaining generally to taxonomy, see http://biodiversity.org.au/voc/boa/BOA.rdf, and the other files alongside it: Name.rdf, Author.rdf, Reference.rdf, Instance.rdf. Terms that are more specific to our National Species List application are in http://biodiversity.org.au/voc/nsl/NSL.rdf.

Naturally, it still needs cleaning up. Consider http://biodiversity.org.au/voc/boa/Name#Rank-visible. This boolean means “ranks like this need to be shown in the name” – eg, ‘var’. Does this belong in the boa vocabulary? Or is it an artifact of how we produce name strings, belonging in the nsl vocabulary? I don’t know. To a degree, it doesn’t really matter – the main thing is the overall organisation of the data as a model of how the business of taxonomy gets done, and the persistence of the URIs for the individual objects.

Before I continue on to what I actually wanted to write about, I see I need to justify not using existing vocabularies.

First: God knows, we tried. Taxon Concept Schema, OAI-PMH, the TDWG ontology, SKOS, LSIDs – it’s all in there in the current stuff. But there’s endless quibbling about whether or not what SKOS means by a ‘concept’ is really the same as a name or an instance of a name (aka: not all instances are taxon concepts), or whether names are really names in the sense intended by the vocabulary you are trying to use, or if some of them are, and if so which ones (An illegitimate name is a name, but is it really a name? Those kinds of discussions.). Multiply this by the total number of terms you are trying to borrow. Each vocabulary presents a list of ranks, of nomenclatural statuses and what have you, but those lists never quite match what we have. 80% are fine.

The underlying cause of this is that taxonomists – you won’t believe this – just make stuff up! That’s why there isn’t a single set of authoritative nomenclatural statuses. Oh, there almost is (he said, laughing bitterly) – but there’s always one or two that don’t quite fit.
The thing is: they’re not doing it for fun, or to be difficult. They are handling the material they have, which varies from place to place. They generate vocabulary because they have to.
Indeed – they have exactly the problem that the IT people have: sometimes, there is no existing term that wouldn’t be wrong. So every site winds up with idiosyncrasies that the computers must handle.

But you are always having to invent extra vocabulary to make things fit properly. We tried to use TCS and wound up putting chunks of information in the ‘custom data’ section (before giving up on TCS altogether because the schema is so tight that it’s very difficult to generate a TCS document that is correct).

The solution we are going with is just to expose the data we have, with a vocabulary that we publish at the site (currently full of “insert description here” descriptions), and to offload the job of determining whether what we mean is what a querent is asking about onto the querent.

Another job I’ve been meaning to do – demonstrate how to set up an empty instance of Fuseki, configure it to talk to our SPARQL endpoint and the dbpedia endpoint (to take the union of the graphs), and write a sparql query that finds macropods in NSW (according to our data) that have a conservation status of endangered according to dbpedia.
Come to think of it – how about finding names on dbpedia that use synonyms rather than our accepted names? Maybe we could give some starry-eyed undergrads the job of fixing them on Wikipedia in exchange for course credit. Everyone’s a winner.

If you want to write a query across our SPARQL server and dbpedia, by all means go for it. Provided that we keep the ids for the actual data items persistent, we can fool around with the triples we serve up for them at a later stage.

Second, and rather more interestingly: some of what we are doing is – as far as I can make out – new. Some of this new stuff is what I am hoping to write about.

Looking back, this is a little long for a blog post, and if I go on to discuss what I actually want to discuss, then this post would have two separate and almost completely unrelated topics.

So I will change the title of this post and start a new one about our concretization of relationship instances, including the vexing question of what I am supposed to call doing it. ‘Concretization’ is just plain wrong.

— EDIT —

I think I am going to have to go with “realize”. It seems odd, because it’s the object of the verb “abstract”, but it’s the only thing that fits.


Fish jelly, and linked data.

February 11, 2015
You can’t nail jelly to a tree. But you can put it in a bucket and slap a label on the bucket

Fish. Or, if you prefer, PISCES. http://biodiversity.org.au/name/PISCES, in fact.

PISCES (as xml) (as json)
URI http://biodiversity.org.au/afd.name/244465
LSID urn:lsid:biodiversity.org.au:afd.name:244465
modified 2013-10-17
Title PISCES
Complete PISCES
Nom. code Zoological
Uninomial PISCES
Taxonomic events
PISCES sensu AFD afd

Our new system has scientific names, with author, date, nomenclatural status – the whole thing. And, of course, these names all have URIs with id numbers. Which is ok from the POV of the semantic web, as URIs are opaque identifiers.

But, but wouldn’t it be nice if you actually could read the URI? Wouln’t it be nice to just use “Megalops cyprinoides” or “Herring” as an id? I mean – it’s that the whole point of naming things? The whole point of identifying things is that the name can be used as, well, an identifier. And indeed, it is used that way. All the time. When you go into the nursery and buy a rosebush, the label proudly displaying the scientific name as well as what most people actually call it, they don’t cite the author of the name. There’s no point saying “but without the author, it doesn’t mean anything” because it quite plainly and obviously does mean something. To lots of people.

So we support it. Eg http://biodiversity.org.au/name/Herring.

The main people to whom a name without author doesn’t mean anything are those people engaged in the work of managing names. The information-science aspect of keeping a nice clean set of identifiers available for the rest of the world to use. Taxonomists, and people at the pointy end of the job of working out what actually lives on this planet of ours. They’re the only ones who really care.

Thing is, our databases are built around that work – that’s the mission here. Within that space a bare scientific name actually is pretty much meaningless. But given that we do support these URIs, from a semweb perspective what does http://biodiversity.org.au/name/PISCES, or for that matter http://biodiversity.org.au/name/Herring mean, exactly? What is it the URI of? What, in a word, is its referent?

I think that the referent of http://biodiversity.org.au/name/Herring has got to be the word “Herring”. That’s all. Perhaps it’s even owl:sameAs text:Herring – if ‘text’ were a URI schema.

Our database does not have anything in it whose identifier is simply ‘Herring’. However, we do have a bunch of other stuff that might be of interest. In particular, we have a name whose URI is http://biodiversity.org.au/afd.name/246141. If you ask for Herring, which is an id that is not in our database, you will get that.

Is this a legit thing to do? The URI http://biodiversity.org.au/name/Herring is kind of in a limbo. On the one hand, the URI exists – you don’t get a 404. On the other hand, we don’t serve up any data pertaining to the URI that you have referenced. You ask for ‘Herring’. You get something else that isn’t Herring (one of its fields is the text value ‘Herring’, but what does that mean?). Now, you and I understand this, but a semantic web engine isn’t going to. To put it another way, over the entire graph hosted at our server, http://biodiversity.org.au/name/Herring sits on its own and isn’t connected to anything.

So. To address this, my plan is to create an object whose URI is http://biodiversity.org.au/name/Herring. It will have a type of – ooh, I dunno – boa-name:SimpleName, I’ll give it an rdf:value of ‘Herring’ and a number of predicates named – say – boa-name:isSimpleNameOf (or probably just do it in reverse: boa-name:hasSimpleName on the name object). There you go.

The situation is slightly different for http://biodiversity.org.au/taxon/Herring. The meaning of this is rather more specific. This URI is an alternative name – owl:sameAs – for the current APC or AFD concept for the valid or vernacular name whose simple name string is ‘Herring’. It’s the accepted taxon for the name at b.o.a .

That is, there is not a separate object. It’s an alternative URI for an object that we already host. And it may change over time: that’s the nature of semantic web data. Actually implementing this in d2r … I don’t know. I might need to build a few database views for the various ways this fact might be derived.

At least that’s the plan. If there are outright collisions of simple names in accepted taxa among our different domains, then, well – this will need to be re-thought a bit. It may very well be that http://biodiversity.org.au/taxon/Herring is not usable as an id, but that http://biodiversity.org.au/taxon/animalia/Herring might be.

In any event. The goal is to give these convenience URIs a home and maybe even make them mean something.


Whenever a name is used, it is used somewhere.

February 7, 2015

Whenever a name is used, it is used somewhere. Every instance of a name appearing in print is either de novo, or it’s a citation of that name appearing somewhere else. That’s logic.

And so, our “name” table should be treated as an optional one-to-one table on name instances. Those name instances that are nomenclatural events

Fun (psudeo?) fact: taxonomists don’t name species and genera and whatnot. God no! So naïve! They name individual specimens. Then they argue about how we should group the specimens into species. Once that’s done, the specimen with the oldest name names the group. This means that scientific names stay more-or-less stable over time, even as new things are discovered and our understanding of how to group them changes.
But what if something doesn’t fit any of the groups? Well, either this “something” of yours has been properly collected and described; or else it hasn’t. If you have properly collected it, then you get to name it; if you haven’t, then it doesn’t exist until you or someone else does. If we didn’t have this rule (“you must actually have a specimen of whatever-it-is”), then we’d have scientific names for unicorns and fairies. ‘Specimen’ is pretty broadly-construed, sometimes, but the rule is still there.

(the act of giving a specimen a name) have additional data about the name they establish, and other instances cite them. In most cases where a name is simply used, it should be treated as a citation of the protonymic instance.

In other words, the NAME_ID on each instance that is not a protonym is – in principle – a derived field. It is (just making up some notation here) “instance is citation of instance”* -> “instance creates name”, over the real world of publications and the names in them.

In a perfect world. We don’t have the whole of the real world in our database. Very few databases do.

Here are a couple of issues:

Common names. Common names sort of exist in the aether, there is no nomenclatural event that creates them. There are also many names which even if they are real, scientific names created validly under whatever code governs them – we don’t necessarily have the creating instance in our data (eg, stuff that doesn’t occur in australia, obscure papers, other reasons).

Invalidly published names that are subsequently validly published. Someone names a specimen, but they didn’t dot the i’s and cross the t’s (often, it’s that they didn’t describe the specimen properly. In Latin, dammit, like what God talks.). Someone else subsequently – sometimes even the same author in the same year – does it right. Now, from one point of view the second work is citing the first. But from the point of view of scientific naming, that first name doesn’t “count” as really being a

When the word ‘name’ is used in this sense, it is always spoken with a specific emphasis difficult to reproduce in a blog post. This distinction results in my boss occasionally saying things like “But those names aren’t names at all!”, and all the scientists in the room nodding solemnly in agreement. This can really put off logical types who haven’t worked here for a while, to whom the distinction is a new one.

name. So things that want to cite the protonym ought to be citing the second occurrence, not the first. I think what happens here is that one is the protologue and the other is the protonym.

Chains of citations. If we in the database modelled it the way it “really” is, you would have to walk the chain of citations to get at the actual name.

So what’s the upshot of this?

The upshot is that of course we have a name table, and of course every instance holds a pointer to the name that is an instance of. What I’m saying is that this should be viewed as a denormalised data structure that – for convenience – doesn’t exactly model what’s really going on, and that’s perfectly ok.


Follow

Get every new post delivered to your Inbox.

Join 80 other followers