SKOS and classifications

April 29, 2015

“Oh, and we’d really like SKOS as well”.


We have names and references.
Names appear in references at “instances” of a name.
Some name instances are taxonomic concepts.

We have a classification made up of tree nodes.
A tree node has many subnodes.
A tree node may be used by many supernodes as a subtree.
A tree node (for classification trees) refers to a taxonomic concept instance.

Now here’s the thing.

The basic job of SKOS in our data is to say “this concept is narrower-than that concept”. X ∈ A → X ∈ B. Naively, we’d put this over the node ids and say job done.

But our tree nodes are most certainly not taxonomic concepts. They are placements of concepts in some classification. It’s the concepts that are concepts. So the skos output must declare that it is the taxon concepts that are narrower-than other taxon concepts.

The problem becomes that the data becomes a mishmash. If a concept is moved from one place to another, then our classification data (which holds the entire history of a tree) will declare in one node that A⊂B and in another that that A⊂C, where B and C are different taxa at the same level.

Not good.

The problem becomes clearer when you consider having two entirely separate and incompatible classifications. It makes sense to say A⊂B according to the people at NSW University, but A⊂C according to the Norwegian National Classification of Herring. These classifications are sources of assertions that you can choose to trust or to not trust.

The nature of our classification nodes becomes clear. Each node is a document – a source of triples – that asserts certain facts and that trusts the assertion of facts in the nodes below it.

All I need to do is to find a suitable ontology for expressing this. The W3C provenance ontology looks promising. It even has – Oh My Lord! I has these predicates

prov:wasRevisionOf prov:invalidatedAtTime prove:wasInvalidatedBy

Which match exactly certain columns in my data.

(edit: why is this important? It indicates that the provenance ontology and my data may very well be talking about the same kinds of thing. If your data has a slot for “number of wheels” and “tonnage” and so does mine, there’s a good chance we are both discussing vehicles.)

I’m pretty comfortable, at this point, deciding that each tree node in a classification is a prov:Entity object. In fact, it seems to be a prov:Bundle in the sense of its subnodes, and a prov:Collection in the sense of the reified subnode axioms, which are skos terms.

I’m probably over-engineering things again. But there you go.

PS: not all subnode relationships are skos broader/narrower term relationships. In particular, hybrids. A hybrid cucumber/pumkin is not simply a kind of cucumber in the same way that a broadbean is a type of legume. I cannot simply assert skos inclusion over a tree – the actual link types and node types matter.

JENA, D2R, TDB, Joseki – happy together at last

March 3, 2015


The magic formula tuned out to be:

  1. Run joseki
  2. Using the d2rq libraries
  3. And just the two joseki jars containing the actual server
  4. Use a new version of TDB, which uses the new ARQ

And she’s up. Dayum. I have the static vocabulary files, the preloaded old APNI, AFD, and 2011 Catalogue of Life, and a live link to our (test) database, and you can write SPARQL queries over the lot. No worries. A trifle slow if you do stuff that spans the data sources.

Now the boss wants me to write some user doco for it. Item 1 is why this stuff matters (it matters because it’s our general query service). So rather than explaining it all here in my blog, I should do it on confluence.

Fish jelly, and linked data.

February 11, 2015
You can’t nail jelly to a tree. But you can put it in a bucket and slap a label on the bucket

Fish. Or, if you prefer, PISCES., in fact.

PISCES (as xml) (as json)
modified 2013-10-17
Complete PISCES
Nom. code Zoological
Uninomial PISCES
Taxonomic events
PISCES sensu AFD afd

Our new system has scientific names, with author, date, nomenclatural status – the whole thing. And, of course, these names all have URIs with id numbers. Which is ok from the POV of the semantic web, as URIs are opaque identifiers.

But, but wouldn’t it be nice if you actually could read the URI? Wouln’t it be nice to just use “Megalops cyprinoides” or “Herring” as an id? I mean – it’s that the whole point of naming things? The whole point of identifying things is that the name can be used as, well, an identifier. And indeed, it is used that way. All the time. When you go into the nursery and buy a rosebush, the label proudly displaying the scientific name as well as what most people actually call it, they don’t cite the author of the name. There’s no point saying “but without the author, it doesn’t mean anything” because it quite plainly and obviously does mean something. To lots of people.

So we support it. Eg

The main people to whom a name without author doesn’t mean anything are those people engaged in the work of managing names. The information-science aspect of keeping a nice clean set of identifiers available for the rest of the world to use. Taxonomists, and people at the pointy end of the job of working out what actually lives on this planet of ours. They’re the only ones who really care.

Thing is, our databases are built around that work – that’s the mission here. Within that space a bare scientific name actually is pretty much meaningless. But given that we do support these URIs, from a semweb perspective what does, or for that matter mean, exactly? What is it the URI of? What, in a word, is its referent?

I think that the referent of has got to be the word “Herring”. That’s all. Perhaps it’s even owl:sameAs text:Herring – if ‘text’ were a URI schema.

Our database does not have anything in it whose identifier is simply ‘Herring’. However, we do have a bunch of other stuff that might be of interest. In particular, we have a name whose URI is If you ask for Herring, which is an id that is not in our database, you will get that.

Is this a legit thing to do? The URI is kind of in a limbo. On the one hand, the URI exists – you don’t get a 404. On the other hand, we don’t serve up any data pertaining to the URI that you have referenced. You ask for ‘Herring’. You get something else that isn’t Herring (one of its fields is the text value ‘Herring’, but what does that mean?). Now, you and I understand this, but a semantic web engine isn’t going to. To put it another way, over the entire graph hosted at our server, sits on its own and isn’t connected to anything.

So. To address this, my plan is to create an object whose URI is It will have a type of – ooh, I dunno – boa-name:SimpleName, I’ll give it an rdf:value of ‘Herring’ and a number of predicates named – say – boa-name:isSimpleNameOf (or probably just do it in reverse: boa-name:hasSimpleName on the name object). There you go.

The situation is slightly different for The meaning of this is rather more specific. This URI is an alternative name – owl:sameAs – for the current APC or AFD concept for the valid or vernacular name whose simple name string is ‘Herring’. It’s the accepted taxon for the name at b.o.a .

That is, there is not a separate object. It’s an alternative URI for an object that we already host. And it may change over time: that’s the nature of semantic web data. Actually implementing this in d2r … I don’t know. I might need to build a few database views for the various ways this fact might be derived.

At least that’s the plan. If there are outright collisions of simple names in accepted taxa among our different domains, then, well – this will need to be re-thought a bit. It may very well be that is not usable as an id, but that might be.

In any event. The goal is to give these convenience URIs a home and maybe even make them mean something.

Neo4j – Don’t like it. I’ll try to explain why.

February 24, 2014

Having looked through the neo4j manual, I am not convinced that it is a good fit for what we are trying to accomplish.

  • The underlying model does not fit well into RDF.

    • It is not a triple store.
    • It is not based around URIs
    • It does not support SPARQL out of the box – it needs add-ons
  • It does not appear to support separate disk files for partitioning data
  • Cypher (the neo4j query language) is not a standard in the same way that SPAQRL is
  • Cypher is still being developed (although there is a mechanism for backward compatibility)

These problems can all be addressed, but they will require add-ons and work-arounds to do so.

The underlying model

Neo4j stores a graph of nodes and arcs. The nodes and arcs can be decorated with what neo4j calls ‘labels’ and ‘properties’.

Labels serve much the same purpose as RDF and OWL classes and predicates. A node may have any number of labels, each arc may have one. One of the main points about them is that one may create indexes (I get the impression that this is actually a legacy feature) on a label:property pair. You can index the ‘name’ property of every node with a ‘Taxon’ label. These indexes can be declared unique, which gives you a degree of data integrity checking (although with nothing like the rigour of an RDBMS).

Properties are simply primitive values – numbers, strings, etc. ‘Data properties’ in OWL-speak.

Problems are:

Property and label names are plain strings

Although labels and property names can be URIs, the Cypher language does not support this beyond allowing you to quote these kinds of identifiers (with back-quotes, of all things). It’s missing the ability to declare namespaces to be used as prefixes so far as I can see.

This means that either we put
all over the shop in the queries, or we bolt something over the top of it to supply the missing prefixes when we convert it to RDF. Or we don’t use dublin core.

Neo4j permits properties to be placed on arcs

While this is basically a great idea, it doesn’t translate into RDF. The way to do this in RDF would be to generate a rdfs:Statement object for each arc, and to attach the properties to that. This means that we require a translation layer (unless the bolt-ons on the web site do something like that).

A problem is that we would want to do this a lot – one of the things we need to do is to attach data to the arcs. Really, its a deficiency with the RDF model itself, but if we want to produce RDF at all then the question of ‘how do we present this data we have put on the arcs’ becomes a thing.

Another issue is that properties are only ever data, not arcs in themselves. One of the things we may want to do is to use a controlled vocabulary for certain properties. Enumerated types. The way we normally do this is to declare a set of URIs. We can certainly put these in strings as properties on arcs, but they wouldn’t link to nodes in the same way. In RDF, a URI is simply a URI. In SPARQL you can query for ‘nodes having a persistence type that is a shade of blue’, because ‘persistence type’ and ‘colour shade’ are nodes in their own right. But if we want arcs to have a ‘persistence type’,  Neo4j just doesn’t work that way.

no quad store

We could simulate a quad store (to permit the SPARQL named graph construct) by adding a property to each node and arc. But again – there would need to be a layer added to translate this hack. Perhaps the SPARQL service built for Neo4j has provision for this.

The data store, and staging

Jena permits a ‘graph’ to be made up of bits stored in different directories on disk. For instance, in our service layer at present the AFD, APNI/APC, and CoL datasets are split into different files. As far as I can see, Neo4j simply doesn’t do this. Another thing that we can do in JENA is load the vocabulary files from RDF as static data. Neo4j would require them to be converted.

I’m not sure how we would both have an editor app that updates the tree and also have a SPAQL service running against that same data, although this is a problem in both Neo4j and Jena/Joseki. We could

  • run the data store as a separate process on a separate port and communicate over http
  • build the core tree manipulation operations as a library module in Neo4j or joseki (communicating via RMI, perhaps)
  • run neo4j or joseki inside the tree webapp. Doing this probably  means we lose all the clustering and management functionality.

Neo4j does do transactions, but it does them by maintaining state in-memory. I’m not 100% confident about that, but then again: I’m not sure how JENA doe them.


Cypher is kinda cute. Nodes have parentheses, arcs have arrows suspiciously like the syntax in Graphviz .dot files, and filtering criteria uniformly have square brackets. It has features which I can’t recall as being in SPARQL, that is: it may be better than SPARQL.

The main thing is as stated: it’s not a standard, and they are still working on it. To be confident your code will continue to work, you need to add a cypher version command at the top of the file.

In Conclusion

As I said: I don’t like it, don’t trust it, but maybe I’m just a stick-in-the-mud. The main issue is the mismatch between this and RDF.

Federating data with JENA – Getting JENA going locally

July 29, 2012

Ok! First step is to get JENA/Joseki up and running. It seems that I am out of date – the current product is “Fuseki”. But Joseki works, and I do not curretly need the new features in Fuseki.

Download site is here.

Unpacking joseki (after downloading from the browser)
pmurray@Paul3:~$ mkdir SPARQL_DEMO
pmurray@Paul3:~$ cd SPARQL_DEMO/
pmurray@Paul3:~/SPARQL_DEMO$ unzip ~/Downloads/ 
pmurray@Paul3:~/SPARQL_DEMO$ ls

Ok! I am going to build a config file with most of the gear ripped out, and I will provide a static RDF file with a bit of sample data.

<?xml version="1.0"?>

    <!ENTITY sample-ontology "urn:local:sample-ontology:" >
    <!ENTITY colour "urn:local:sample-ontology:colour:" >
    <!ENTITY thing "urn:local:sample-ontology:thing:" >
    <!ENTITY owl "" >
    <!ENTITY xsd "" >
    <!ENTITY rdfs "" >
    <!ENTITY rdf "" >


    <owl:Ontology rdf:about=""/>
    <owl:Class rdf:about="&sample-ontology;Colour"/>
    <owl:Class rdf:about="&sample-ontology;ColouredThing"/>

    <owl:ObjectProperty rdf:about="&sample-ontology;hasColour">
        <rdfs:range rdf:resource="&sample-ontology;Colour"/>
        <rdfs:domain rdf:resource="&sample-ontology;ColouredThing"/>
    <Colour rdf:about="&colour;RED"/>
    <Colour rdf:about="&colour;ORANGE"/>
    <Colour rdf:about="&colour;YELLOW"/>
    <Colour rdf:about="&colour;GREEN"/>
    <Colour rdf:about="&colour;BLUE"/>
    <Colour rdf:about="&colour;INDIGO"/>
    <Colour rdf:about="&colour;PURPLE"/>

    <ColouredThing rdf:about="&thing;GREENBALL">
        <hasColour rdf:resource="&colour;GREEN"/>

    <ColouredThing rdf:about="&thing;REDBALL">
        <hasColour rdf:resource="&colour;RED"/>

Ok! And we need a very, very basic config file. It’s a bit sad that this counts as “basic”, but there’s not a lot of way around it:

@prefix rdfs:   <> .
@prefix rdf:    <> .
@prefix xsd:    <> .

@prefix module: <> .
@prefix joseki: <> .
@prefix ja:     <> .

@prefix : <urn:local:joseki:config:> .

@prefix graph: <urn:local:graph:> .

  rdf:type joseki:Server;
  joseki:serverDebug "true".

ja:MemoryModel rdfs:subClassOf ja:Model .
ja:UnionModel rdfs:subClassOf ja:Model .

  a ja:MemoryModel ;
  ja:content [
    ja:externalContent <file:sample.rdf> 
  ] .

:empty_graph a ja:MemoryModel .

:dataset a ja:RDFDataset ;
  ja:defaultGraph :empty_graph ;
  ja:namedGraph [ 
    ja:graphName graph:sample ; 
    ja:graph :sample_vocabulary  
  ] .

  rdf:type joseki:Service ;
  rdfs:label "SPARQL-SDB";
  joseki:serviceRef "sparql/";
  joseki:dataset :dataset;
  joseki:processor [
    rdfs:label "SPARQL processor" ;
    rdf:type joseki:Processor ;
    module:implementation [  
      rdf:type joseki:ServiceImpl;
      module:className <java:org.joseki.processors.SPARQL>
    ] ;
    joseki:allowExplicitDataset "false"^^xsd:boolean ;
    joseki:allowWebLoading "false"^^xsd:boolean ;
    joseki:lockingPolicy  joseki:lockingPolicyMRSW
  ] .

Great! Now we need to actually start the server with the config file that we have provided:
export JOSEKIROOT=$DD/Joseki-3.4.4
$JOSEKIROOT/bin/rdfserver --port 8081 $DD/joseki_config.ttl

Do please note that the joseki service needs to be running to make the urls work. I mention it in the spirit of “please check that your computer is plugged in”.

Starting the sparql server
pmurray@Paul3:~/SPARQL_DEMO$ ./

And the server starts perfectly fine. At this point, I should be able to navigate to http://localhost:8081/sparql/ (note the slash at the end).

It works fine – joseki correctly complains that I have not given it a query string. So lets give it one!

http://localhost:8081/sparql/?output=text&query=select * where { graph ?g { ?s ?p ?o } }

Now I want a better web interface than typing SPARQL into a command line, so I will use this from now on:

      <form action="http://localhost:8081/sparql/" method="post" target="SPARQLOUTPUT">
	  <textarea style="background-color: #F0F0F0;" name="query" cols="70" rows="27">
select ?g ?s ?p ?o
where { 
  graph ?g { 
    ?s ?p ?o
ORDER BY ?g ?s ?p ?o
	  <input type="radio" name="output" value="xml"> xml, 
	  <input type="radio" name="output" value="json"> json,
	  <input type="radio" name="output" value="text"> text,
	  <input type="radio" name="output" value="csv"> csv,
	  <input type="radio" name="output" value="tsv" checked> tsv<br>
          Force <tt>text/plain</tt>: <input type="checkbox" name="force-accept" value="text/plain"><br>
	  <input type="submit" value="Get Results" >

And that does the job. Click “force plain” to stop your prowser from downloading the output as a file.


Federating data with JENA

July 29, 2012

I am going to attempt here to bring it all together and make some magic happen with SPARQL and RDF. My goal is to run a local and largely blank instance of JENA which fetches data from heterogeneous data sources, and applies reasoning rules over the top.

The goal is to demonstrate that rdf can be useful even without global, worldwide agreement on vocabulary and ontology. The key to making this work is not getting everyone to agree on terms and what they mean by terms, but to get everyone to clearly state what terms they use and what they they mean by them. Hopefully, the subject matter itself means that the meanings are pretty much compatible.

Speaking of meanings: before I continue, I’d like to apologise in advance for my inevitable solecisms. I’m a computing person, not a biologist or taxonomist.

Step 1: Getting JENA going locally

Step 2: Linking the local JENA to more than one external SPARQL service

Step 3: Using OWL to translate the foreign data into a common local vocabulary

Step 4: running a query.


February 16, 2012


Rather than use an oracle database as the back-end store for our sparql service, I am using TDB: a back end that comes with JENA. It’s just some files in a local directory.

This means that the boa content has to go on the web server, but that’s ok. it’s about 42 gig.

And my God it’s fast. Even checking a regex against all names is … tolerable. Several seconds.

Must add search fields for the names and authors – converted to uppercase, all diacritics removed. Perhaps even add taxamatch conversion.

Also need to write some demo pages and host them on . My demo pages use AJAX, which creates cross-site scripting issues if the are not hosted there.