JENA, D2R, TDB, Joseki – happy together at last

March 3, 2015


The magic formula tuned out to be:

  1. Run joseki
  2. Using the d2rq libraries
  3. And just the two joseki jars containing the actual server
  4. Use a new version of TDB, which uses the new ARQ

And she’s up. Dayum. I have the static vocabulary files, the preloaded old APNI, AFD, and 2011 Catalogue of Life, and a live link to our (test) database, and you can write SPARQL queries over the lot. No worries. A trifle slow if you do stuff that spans the data sources.

Now the boss wants me to write some user doco for it. Item 1 is why this stuff matters (it matters because it’s our general query service). So rather than explaining it all here in my blog, I should do it on confluence.


JAR file hell with Jena and D2RQ

February 24, 2015

I would very much like for D2RQ to work as a subgraph inside an existing joseki installation. But sweet baby Jesus I just can’t make the nine million JAR file libraries talk to each other.

Tried to put the D2RQ assembler into joseki. Won’t run without an ‘incubation’ library. Blows up with a class file version exception, which means that the jars were compiled with other versions. Which is nuts, because the version of joseki I am using – 3.3.4 – is the same as is internally inside d2rq.

Tried to put the D2RQ assembler into fuseki. Fuseki attempts to call “init()” on the assembler, which ain’t there. According to the d2rq project, there is no such method on the interface, so clearly d2r was compiled to a different specification of jena than was fuseki.

Tried to launch the joseki that is inside the d2r installation (which obviously works) as a joseki instance rather than a d2r instance. Nope. joseki.rdfserver isn’t there.

Tried to get the d2rq and joseki source so as to compile them together on the same machine. But the build file specifies java versions, the git project HEAD points to a dev branch, and the joseki project isn’t even git.

I am at the stage of hacking up the d2r server code itself – it has the joseki and the d2rq classes living with one another, I have the source and it all compiles and builds ok. Issue is that when it launches, the “go and do it” class creates the d2r graph as a top-level object and as a default graph (from a sparql point of view). This won’t do – I need a top-level graph that is a stuck-together frankenstein that has the d2r component as a mere subsection of what is going on. The “go and do it” returns D2RQModel rather than the interface Model. Happily, I can fix at least that and it still compiles. So maybe I can build the graph that I want internally. But this means learning the programatic interface to jena – I already have assemblers that are correct (it’s just that they won’t run withou colliding into class file version issues). Perhaps just find the source of joseki.rdfserver and copy/paste it into the d2r project? Maybe that’s got a magic “read an assembler and spark up a SPARQL service” method?

If anyone out there has managed to get the d2r assembler working inside fuseki or joseki, or for that matter any implementation of a sparql endpoint, I would be terribly grateful for some tips.

RDF vocabulary – still a problem

February 11, 2015

I probably shouldn’t tell you this, but I have a test instance of d2rq running against a test copy of the new database at

I’m finding that whenever d2rq terminates a request due to a timeout, it seems also to be closing the database connection, or something. I’m not sure it manages transactions properly when its in a web container. Perhaps I need to give it different connection parameters – tell it to use JNDI, perhaps. A problem is that parts of its config measure time in seconds, other parts measure it in milliseconds. I personally have believed ever since 2002 that any variable holding a physical quantity must have a name suffixed with the unit. It would have saved that Mars mission.

By the time you read this, it has probably already crashed and will need to be bounced tomorrow. But hopefully not.

Most of my work over the past few weeks has been to build a d2rq mapping for our data, and to write a corresponding OWL vocabulary. I have not attempted to use existing vocablaries and fit them to what we are doing, instead opting to write a self-contained vocabulary. For terms that – hopefully – map to common things pertaining generally to taxonomy, see, and the other files alongside it: Name.rdf, Author.rdf, Reference.rdf, Instance.rdf. Terms that are more specific to our National Species List application are in

Naturally, it still needs cleaning up. Consider This boolean means “ranks like this need to be shown in the name” – eg, ‘var’. Does this belong in the boa vocabulary? Or is it an artifact of how we produce name strings, belonging in the nsl vocabulary? I don’t know. To a degree, it doesn’t really matter – the main thing is the overall organisation of the data as a model of how the business of taxonomy gets done, and the persistence of the URIs for the individual objects.

Before I continue on to what I actually wanted to write about, I see I need to justify not using existing vocabularies.

First: God knows, we tried. Taxon Concept Schema, OAI-PMH, the TDWG ontology, SKOS, LSIDs – it’s all in there in the current stuff. But there’s endless quibbling about whether or not what SKOS means by a ‘concept’ is really the same as a name or an instance of a name (aka: not all instances are taxon concepts), or whether names are really names in the sense intended by the vocabulary you are trying to use, or if some of them are, and if so which ones (An illegitimate name is a name, but is it really a name? Those kinds of discussions.). Multiply this by the total number of terms you are trying to borrow. Each vocabulary presents a list of ranks, of nomenclatural statuses and what have you, but those lists never quite match what we have. 80% are fine.

The underlying cause of this is that taxonomists – you won’t believe this – just make stuff up! That’s why there isn’t a single set of authoritative nomenclatural statuses. Oh, there almost is (he said, laughing bitterly) – but there’s always one or two that don’t quite fit.
The thing is: they’re not doing it for fun, or to be difficult. They are handling the material they have, which varies from place to place. They generate vocabulary because they have to.
Indeed – they have exactly the problem that the IT people have: sometimes, there is no existing term that wouldn’t be wrong. So every site winds up with idiosyncrasies that the computers must handle.

But you are always having to invent extra vocabulary to make things fit properly. We tried to use TCS and wound up putting chunks of information in the ‘custom data’ section (before giving up on TCS altogether because the schema is so tight that it’s very difficult to generate a TCS document that is correct).

The solution we are going with is just to expose the data we have, with a vocabulary that we publish at the site (currently full of “insert description here” descriptions), and to offload the job of determining whether what we mean is what a querent is asking about onto the querent.

Another job I’ve been meaning to do – demonstrate how to set up an empty instance of Fuseki, configure it to talk to our SPARQL endpoint and the dbpedia endpoint (to take the union of the graphs), and write a sparql query that finds macropods in NSW (according to our data) that have a conservation status of endangered according to dbpedia.
Come to think of it – how about finding names on dbpedia that use synonyms rather than our accepted names? Maybe we could give some starry-eyed undergrads the job of fixing them on Wikipedia in exchange for course credit. Everyone’s a winner.

If you want to write a query across our SPARQL server and dbpedia, by all means go for it. Provided that we keep the ids for the actual data items persistent, we can fool around with the triples we serve up for them at a later stage.

Second, and rather more interestingly: some of what we are doing is – as far as I can make out – new. Some of this new stuff is what I am hoping to write about.

Looking back, this is a little long for a blog post, and if I go on to discuss what I actually want to discuss, then this post would have two separate and almost completely unrelated topics.

So I will change the title of this post and start a new one about our concretization of relationship instances, including the vexing question of what I am supposed to call doing it. ‘Concretization’ is just plain wrong.

— EDIT —

I think I am going to have to go with “realize”. It seems odd, because it’s the object of the verb “abstract”, but it’s the only thing that fits.


January 7, 2015

(this post is a bit of a note-to-self. Apologies if it lacks enough context to make it understandable.)

So. We would like to publish our data to the semantic web live using d2rq. This is actually pretty exciting. It will replace our eXist XML database. eXist was a nice idea, but there turned out to be all sorts of problems with using it as a platform.

Furthermore, our new data model embodies a different way of looking at names, references, and taxon concepts which we believe to be an advance on the current TDWG picture. Part of the purpose of publishing the data to the semantic web is to expose this new data model.

But, that’s not what this post is about.

I ran into a nasty little problem with mapping the database, which I seem to have solved, and it wasn’t obvious from the d2rq docs.

It’s like this:

  1. We have a table NAME.
  2. It has an optional one-to-one join to SIMPLE_NAME, which has derived, denormalised data.
  3. SIMPLE_NAME has three fields: FAMILY, GENUS, SPECIES which link back to name.
  4. I would like to expose this with hasFamily, hasGenus, and hasSpecies.

How hard could it be?

I found that when I put one of these properties in, everything was sweet. When I put two in – oh my Lord. It started thinking that things were the genus-of their family: stuff like that.

Eventually, I found something that worked.

The clue is that table aliases are global across the entire configuration. So what works is to create a completely new entity mapping for name, using a table alias (simplename_family, simplename_genus, simplename_species) and so on. Using those mappings, the underlying gear seems to be able to produce the triples without tripping over itself – it just treats them as separate things.

But, you may ask, isn’t it going to be a drag to link all of these things with owl:sameAs?

Not at all!

It seems that d2rq is perfectly happy to have multiple mappings that resolve to the same uri. At a guess, it generates a humungous union query underneath it all.

The relevant code looks a bit like this:

# ===============================================================================================
# main name table

map:APNI_Name a d2rq:ClassMap;
	d2rq:dataStorage map:APNI_database;
	d2rq:uriPattern "";
	d2rq:class <>;

# this is the only actual data field that I am pulling out at present

map:APNI_Name_fullName a d2rq:PropertyBridge;
	d2rq:belongsToClassMap map:APNI_Name;
	d2rq:property nsl_name:fullName;
	d2rq:propertyDefinitionLabel "name.fullName";
	d2rq:column "name.full_name";

# ===============================================================================================
# simple name - derived links

map:APNI_SimpleName a d2rq:ClassMap;
	d2rq:dataStorage map:APNI_database;
	d2rq:uriPattern "";
	d2rq:class <>;

# ===============================================================================================
# Alias the name table once for each join

map:APNI_simplenameFamily a d2rq:ClassMap;
	d2rq:dataStorage map:APNI_database;
	d2rq:uriPattern "";
 	d2rq:alias "name AS simplename_family";
	d2rq:class <>;

map:APNI_simplenameGenus a d2rq:ClassMap;
	d2rq:dataStorage map:APNI_database;
	d2rq:uriPattern "";
 	d2rq:alias "name AS simplename_genus";
	d2rq:class <>;

map:APNI_simplenameSpecies a d2rq:ClassMap;
	d2rq:dataStorage map:APNI_database;
	d2rq:uriPattern "";
 	d2rq:alias "name AS simplename_species";
	d2rq:class <>;

# ===============================================================================================
# Map the joins on simplename

map:APNI_SimpleName_family a d2rq:PropertyBridge;
    d2rq:belongsToClassMap map:APNI_SimpleName;
    d2rq:property nsl_name:hasFamily;
    d2rq:alias "name AS simplename_family";
    d2rq:refersToClassMap map:APNI_simplenameFamily;
    d2rq:join "nsl_simple_name.family_nsl_id =>";
    d2rq:limitInverse 3;

map:APNI_SimpleName_genus a d2rq:PropertyBridge;
    d2rq:belongsToClassMap map:APNI_SimpleName;
    d2rq:property nsl_name:hasGenus;
    d2rq:alias "name AS simplename_genus";
    d2rq:refersToClassMap map:APNI_simplenameGenus;
    d2rq:join "nsl_simple_name.genus_nsl_id =>";
    d2rq:limitInverse 3;

map:APNI_SimpleName_species a d2rq:PropertyBridge;
    d2rq:belongsToClassMap map:APNI_SimpleName;
    d2rq:property nsl_name:hasSpecies;
    d2rq:alias "name AS simplename_species";
    d2rq:refersToClassMap map:APNI_simplenameSpecies;
    d2rq:join "nsl_simple_name.species_nsl_id =>";
    d2rq:limitInverse 3;

And with that, this SPARQL:

SELECT ?s ?p ?o WHERE {
      <http://localhost:2020/resource/> ?p ?o 
      ?s ?p <http://localhost:2020/resource/>

Correctly produces the output (apologies for the wordpress clipping and html entities):

nsl_name:fullName“Orchidaceae Juss.”
rdfs:label“name #54444: Orchidaceae Juss.”

Neo4j – Don’t like it. I’ll try to explain why.

February 24, 2014

Having looked through the neo4j manual, I am not convinced that it is a good fit for what we are trying to accomplish.

  • The underlying model does not fit well into RDF.
    • It is not a triple store.
    • It is not based around URIs
    • It does not support SPARQL out of the box – it needs add-ons
  • It does not appear to support separate disk files for partitioning data
  • Cypher (the neo4j query language) is not a standard in the same way that SPAQRL is
  • Cypher is still being developed (although there is a mechanism for backward compatibility)

These problems can all be addressed, but they will require add-ons and work-arounds to do so.

The underlying model

Neo4j stores a graph of nodes and arcs. The nodes and arcs can be decorated with what neo4j calls ‘labels’ and ‘properties’.

Labels serve much the same purpose as RDF and OWL classes and predicates. A node may have any number of labels, each arc may have one. One of the main points about them is that one may create indexes (I get the impression that this is actually a legacy feature) on a label:property pair. You can index the ‘name’ property of every node with a ‘Taxon’ label. These indexes can be declared unique, which gives you a degree of data integrity checking (although with nothing like the rigour of an RDBMS).

Properties are simply primitive values – numbers, strings, etc. ‘Data properties’ in OWL-speak.

Problems are:

Property and label names are plain strings

Although labels and property names can be URIs, the Cypher language does not support this beyond allowing you to quote these kinds of identifiers (with back-quotes, of all things). It’s missing the ability to declare namespaces to be used as prefixes so far as I can see.

This means that either we put
all over the shop in the queries, or we bolt something over the top of it to supply the missing prefixes when we convert it to RDF. Or we don’t use dublin core.

Neo4j permits properties to be placed on arcs

While this is basically a great idea, it doesn’t translate into RDF. The way to do this in RDF would be to generate a rdfs:Statement object for each arc, and to attach the properties to that. This means that we require a translation layer (unless the bolt-ons on the web site do something like that).

A problem is that we would want to do this a lot – one of the things we need to do is to attach data to the arcs. Really, its a deficiency with the RDF model itself, but if we want to produce RDF at all then the question of ‘how do we present this data we have put on the arcs’ becomes a thing.

Another issue is that properties are only ever data, not arcs in themselves. One of the things we may want to do is to use a controlled vocabulary for certain properties. Enumerated types. The way we normally do this is to declare a set of URIs. We can certainly put these in strings as properties on arcs, but they wouldn’t link to nodes in the same way. In RDF, a URI is simply a URI. In SPARQL you can query for ‘nodes having a persistence type that is a shade of blue’, because ‘persistence type’ and ‘colour shade’ are nodes in their own right. But if we want arcs to have a ‘persistence type’,  Neo4j just doesn’t work that way.

no quad store

We could simulate a quad store (to permit the SPARQL named graph construct) by adding a property to each node and arc. But again – there would need to be a layer added to translate this hack. Perhaps the SPARQL service built for Neo4j has provision for this.

The data store, and staging

Jena permits a ‘graph’ to be made up of bits stored in different directories on disk. For instance, in our service layer at present the AFD, APNI/APC, and CoL datasets are split into different files. As far as I can see, Neo4j simply doesn’t do this. Another thing that we can do in JENA is load the vocabulary files from RDF as static data. Neo4j would require them to be converted.

I’m not sure how we would both have an editor app that updates the tree and also have a SPAQL service running against that same data, although this is a problem in both Neo4j and Jena/Joseki. We could

  • run the data store as a separate process on a separate port and communicate over http
  • build the core tree manipulation operations as a library module in Neo4j or joseki (communicating via RMI, perhaps)
  • run neo4j or joseki inside the tree webapp. Doing this probably  means we lose all the clustering and management functionality.

Neo4j does do transactions, but it does them by maintaining state in-memory. I’m not 100% confident about that, but then again: I’m not sure how JENA doe them.


Cypher is kinda cute. Nodes have parentheses, arcs have arrows suspiciously like the syntax in Graphviz .dot files, and filtering criteria uniformly have square brackets. It has features which I can’t recall as being in SPARQL, that is: it may be better than SPARQL.

The main thing is as stated: it’s not a standard, and they are still working on it. To be confident your code will continue to work, you need to add a cypher version command at the top of the file.

In Conclusion

As I said: I don’t like it, don’t trust it, but maybe I’m just a stick-in-the-mud. The main issue is the mismatch between this and RDF.

Federating data with JENA – Getting JENA going locally

July 29, 2012

Ok! First step is to get JENA/Joseki up and running. It seems that I am out of date – the current product is “Fuseki”. But Joseki works, and I do not curretly need the new features in Fuseki.

Download site is here.

Unpacking joseki (after downloading from the browser)
pmurray@Paul3:~$ mkdir SPARQL_DEMO
pmurray@Paul3:~$ cd SPARQL_DEMO/
pmurray@Paul3:~/SPARQL_DEMO$ unzip ~/Downloads/ 
pmurray@Paul3:~/SPARQL_DEMO$ ls

Ok! I am going to build a config file with most of the gear ripped out, and I will provide a static RDF file with a bit of sample data.

<?xml version="1.0"?>

    <!ENTITY sample-ontology "urn:local:sample-ontology:" >
    <!ENTITY colour "urn:local:sample-ontology:colour:" >
    <!ENTITY thing "urn:local:sample-ontology:thing:" >
    <!ENTITY owl "" >
    <!ENTITY xsd "" >
    <!ENTITY rdfs "" >
    <!ENTITY rdf "" >


    <owl:Ontology rdf:about=""/>
    <owl:Class rdf:about="&sample-ontology;Colour"/>
    <owl:Class rdf:about="&sample-ontology;ColouredThing"/>

    <owl:ObjectProperty rdf:about="&sample-ontology;hasColour">
        <rdfs:range rdf:resource="&sample-ontology;Colour"/>
        <rdfs:domain rdf:resource="&sample-ontology;ColouredThing"/>
    <Colour rdf:about="&colour;RED"/>
    <Colour rdf:about="&colour;ORANGE"/>
    <Colour rdf:about="&colour;YELLOW"/>
    <Colour rdf:about="&colour;GREEN"/>
    <Colour rdf:about="&colour;BLUE"/>
    <Colour rdf:about="&colour;INDIGO"/>
    <Colour rdf:about="&colour;PURPLE"/>

    <ColouredThing rdf:about="&thing;GREENBALL">
        <hasColour rdf:resource="&colour;GREEN"/>

    <ColouredThing rdf:about="&thing;REDBALL">
        <hasColour rdf:resource="&colour;RED"/>

Ok! And we need a very, very basic config file. It’s a bit sad that this counts as “basic”, but there’s not a lot of way around it:

@prefix rdfs:   <> .
@prefix rdf:    <> .
@prefix xsd:    <> .

@prefix module: <> .
@prefix joseki: <> .
@prefix ja:     <> .

@prefix : <urn:local:joseki:config:> .

@prefix graph: <urn:local:graph:> .

  rdf:type joseki:Server;
  joseki:serverDebug "true".

ja:MemoryModel rdfs:subClassOf ja:Model .
ja:UnionModel rdfs:subClassOf ja:Model .

  a ja:MemoryModel ;
  ja:content [
    ja:externalContent <file:sample.rdf> 
  ] .

:empty_graph a ja:MemoryModel .

:dataset a ja:RDFDataset ;
  ja:defaultGraph :empty_graph ;
  ja:namedGraph [ 
    ja:graphName graph:sample ; 
    ja:graph :sample_vocabulary  
  ] .

  rdf:type joseki:Service ;
  rdfs:label "SPARQL-SDB";
  joseki:serviceRef "sparql/";
  joseki:dataset :dataset;
  joseki:processor [
    rdfs:label "SPARQL processor" ;
    rdf:type joseki:Processor ;
    module:implementation [  
      rdf:type joseki:ServiceImpl;
      module:className <java:org.joseki.processors.SPARQL>
    ] ;
    joseki:allowExplicitDataset "false"^^xsd:boolean ;
    joseki:allowWebLoading "false"^^xsd:boolean ;
    joseki:lockingPolicy  joseki:lockingPolicyMRSW
  ] .

Great! Now we need to actually start the server with the config file that we have provided:
export JOSEKIROOT=$DD/Joseki-3.4.4
$JOSEKIROOT/bin/rdfserver --port 8081 $DD/joseki_config.ttl

Do please note that the joseki service needs to be running to make the urls work. I mention it in the spirit of “please check that your computer is plugged in”.

Starting the sparql server
pmurray@Paul3:~/SPARQL_DEMO$ ./

And the server starts perfectly fine. At this point, I should be able to navigate to http://localhost:8081/sparql/ (note the slash at the end).

It works fine – joseki correctly complains that I have not given it a query string. So lets give it one!

http://localhost:8081/sparql/?output=text&query=select * where { graph ?g { ?s ?p ?o } }

Now I want a better web interface than typing SPARQL into a command line, so I will use this from now on:

      <form action="http://localhost:8081/sparql/" method="post" target="SPARQLOUTPUT">
	  <textarea style="background-color: #F0F0F0;" name="query" cols="70" rows="27">
select ?g ?s ?p ?o
where { 
  graph ?g { 
    ?s ?p ?o
ORDER BY ?g ?s ?p ?o
	  <input type="radio" name="output" value="xml"> xml, 
	  <input type="radio" name="output" value="json"> json,
	  <input type="radio" name="output" value="text"> text,
	  <input type="radio" name="output" value="csv"> csv,
	  <input type="radio" name="output" value="tsv" checked> tsv<br>
          Force <tt>text/plain</tt>: <input type="checkbox" name="force-accept" value="text/plain"><br>
	  <input type="submit" value="Get Results" >

And that does the job. Click “force plain” to stop your prowser from downloading the output as a file.


Federating data with JENA

July 29, 2012

I am going to attempt here to bring it all together and make some magic happen with SPARQL and RDF. My goal is to run a local and largely blank instance of JENA which fetches data from heterogeneous data sources, and applies reasoning rules over the top.

The goal is to demonstrate that rdf can be useful even without global, worldwide agreement on vocabulary and ontology. The key to making this work is not getting everyone to agree on terms and what they mean by terms, but to get everyone to clearly state what terms they use and what they they mean by them. Hopefully, the subject matter itself means that the meanings are pretty much compatible.

Speaking of meanings: before I continue, I’d like to apologise in advance for my inevitable solecisms. I’m a computing person, not a biologist or taxonomist.

Step 1: Getting JENA going locally

Step 2: Linking the local JENA to more than one external SPARQL service

Step 3: Using OWL to translate the foreign data into a common local vocabulary

Step 4: running a query.