Dev Environment

June 23, 2015

So. To run our app on my local machine I need

  1. A database. Postgres.
  2. An LDAP server. Apache.
  3. The “mapper” webapp. Grails on 7070.
  4. The “services” webapp. Grails on 8080.
  5. The “editor” rebapp. Rails on 3000.

And then it all goes.

Or it did, until I started putting security annotations on the groovy services.

I’m guessing the problem is that the front end – the browser client – needs to pass login cookies to the server app in order to make certain JSON requests, and it won’t do this because the login cookie has a different origin.


  1. Reverse proxy. Squid on (say) 9090.

With a bit of luck, I can explain to squid to send editor requests to the editor and service requests to the services. Hopefully, the web browser will be completely fooled by this and merrily send along the authentication tokens.

I had hoped to avoid needing to know much about Shiro, spring security, and related issues. But, it seems there’s little avoiding it. :(

Man, I’m looking forward to getting my Apple trashcan and not having to wait a minute and a half to bounce everything. Not so much looking forward to putting aaaaall the stuff on it that I am going to need. Dear God I hope it just installs the stuff without undue fuss – gigabytes of crap, starting with XCode (which you need for the C compiler and other basic utilities). I’m going to have to fast-forward 5 years of software installation history at the site I work at, as well as moving to homebrew in place of macports. Maybe I should live-blog it. It’s bound to be hilarious, and involve a huge amount of very bad language.

Must remember not to do it on Wednesday – that’s when the clients drop by to discuss progress.

Git squash

June 15, 2015

Just for my own reference:

I do a lot of work on branches, and I like to check them in periodically as I achieve chunks of work. But it’s quite granular – way too granular for the main branch. So I’d like to compress them down into a single revision for whatever feature I am implementing for the main branch.

Here’s how it’s done. Let’s say I’m on branch NSL-1168 (I name my branches for their JIRA id). Everything needs to be clean and checked in.

First, the state of master after the most recent merge of NSL-1168 into it is tagged (to begin using this process, I have to do that manually). Then:

Grab the comments on all my incremental changes and concatenate them:

echo 'NSL-1168 squash merge' `date` > ~/tmp/merge_comment
echo >> ~/tmp/merge_comment
git log --ancestry-path NSL-1168-last-merge..NSL-1168 >> ~/tmp/merge_comment

Merge master into branch. At this stage, work may need to be done to make the merge work. Hopefully not.

git checkout NSL-1168
git merge master

Squash merge the branch into master. This is the magic bit. We merge and commit with the big long comment detaling all the changes.

git checkout master
git merge --squash NSL-1168
git commit -F ~/tmp/merge_comment

Finally, tidy up. The squash merge doesn’t produce an explicit record that the master and my branch were synced, so I merge master into the branch again – a no-op – just to document inside git that the branches are the same at that point.

Finally, I move the ‘last-merge’ tag. All my incremental changes from this point will go into the master as a bundle next time I do this.

git checkout NSL-1168
git merge master
git tag -f NSL-1168-last-merge

This leaves the branch one commit ahead of the master, which is ok. We do not merge the branch into the master because not doing that is the whole point of the exercise. Ultimately, the master will look like the branch has never been merged into it. But that’s ok – the squash merges will do the job. This means also that if the JIRA branch is deleted from the repository, then all of those incremental changes will be unreferenced and will disappear. Again – this is ok.

SKOS and classifications

April 29, 2015

“Oh, and we’d really like SKOS as well”.


We have names and references.
Names appear in references at “instances” of a name.
Some name instances are taxonomic concepts.

We have a classification made up of tree nodes.
A tree node has many subnodes.
A tree node may be used by many supernodes as a subtree.
A tree node (for classification trees) refers to a taxonomic concept instance.

Now here’s the thing.

The basic job of SKOS in our data is to say “this concept is narrower-than that concept”. X ∈ A → X ∈ B. Naively, we’d put this over the node ids and say job done.

But our tree nodes are most certainly not taxonomic concepts. They are placements of concepts in some classification. It’s the concepts that are concepts. So the skos output must declare that it is the taxon concepts that are narrower-than other taxon concepts.

The problem becomes that the data becomes a mishmash. If a concept is moved from one place to another, then our classification data (which holds the entire history of a tree) will declare in one node that A⊂B and in another that that A⊂C, where B and C are different taxa at the same level.

Not good.

The problem becomes clearer when you consider having two entirely separate and incompatible classifications. It makes sense to say A⊂B according to the people at NSW University, but A⊂C according to the Norwegian National Classification of Herring. These classifications are sources of assertions that you can choose to trust or to not trust.

The nature of our classification nodes becomes clear. Each node is a document – a source of triples – that asserts certain facts and that trusts the assertion of facts in the nodes below it.

All I need to do is to find a suitable ontology for expressing this. The W3C provenance ontology looks promising. It even has – Oh My Lord! I has these predicates

prov:wasRevisionOf prov:invalidatedAtTime prove:wasInvalidatedBy

Which match exactly certain columns in my data.

(edit: why is this important? It indicates that the provenance ontology and my data may very well be talking about the same kinds of thing. If your data has a slot for “number of wheels” and “tonnage” and so does mine, there’s a good chance we are both discussing vehicles.)

I’m pretty comfortable, at this point, deciding that each tree node in a classification is a prov:Entity object. In fact, it seems to be a prov:Bundle in the sense of its subnodes, and a prov:Collection in the sense of the reified subnode axioms, which are skos terms.

I’m probably over-engineering things again. But there you go.

PS: not all subnode relationships are skos broader/narrower term relationships. In particular, hybrids. A hybrid cucumber/pumkin is not simply a kind of cucumber in the same way that a broadbean is a type of legume. I cannot simply assert skos inclusion over a tree – the actual link types and node types matter.


March 5, 2015

Our team’s twitter account is . Follow for updates in regard to … stuff. Biodiversity stuff.

JENA, D2R, TDB, Joseki – happy together at last

March 3, 2015


The magic formula tuned out to be:

  1. Run joseki
  2. Using the d2rq libraries
  3. And just the two joseki jars containing the actual server
  4. Use a new version of TDB, which uses the new ARQ

And she’s up. Dayum. I have the static vocabulary files, the preloaded old APNI, AFD, and 2011 Catalogue of Life, and a live link to our (test) database, and you can write SPARQL queries over the lot. No worries. A trifle slow if you do stuff that spans the data sources.

Now the boss wants me to write some user doco for it. Item 1 is why this stuff matters (it matters because it’s our general query service). So rather than explaining it all here in my blog, I should do it on confluence.

JAR file hell with Jena and D2RQ

February 24, 2015

I would very much like for D2RQ to work as a subgraph inside an existing joseki installation. But sweet baby Jesus I just can’t make the nine million JAR file libraries talk to each other.

Tried to put the D2RQ assembler into joseki. Won’t run without an ‘incubation’ library. Blows up with a class file version exception, which means that the jars were compiled with other versions. Which is nuts, because the version of joseki I am using – 3.3.4 – is the same as is internally inside d2rq.

Tried to put the D2RQ assembler into fuseki. Fuseki attempts to call “init()” on the assembler, which ain’t there. According to the d2rq project, there is no such method on the interface, so clearly d2r was compiled to a different specification of jena than was fuseki.

Tried to launch the joseki that is inside the d2r installation (which obviously works) as a joseki instance rather than a d2r instance. Nope. joseki.rdfserver isn’t there.

Tried to get the d2rq and joseki source so as to compile them together on the same machine. But the build file specifies java versions, the git project HEAD points to a dev branch, and the joseki project isn’t even git.

I am at the stage of hacking up the d2r server code itself – it has the joseki and the d2rq classes living with one another, I have the source and it all compiles and builds ok. Issue is that when it launches, the “go and do it” class creates the d2r graph as a top-level object and as a default graph (from a sparql point of view). This won’t do – I need a top-level graph that is a stuck-together frankenstein that has the d2r component as a mere subsection of what is going on. The “go and do it” returns D2RQModel rather than the interface Model. Happily, I can fix at least that and it still compiles. So maybe I can build the graph that I want internally. But this means learning the programatic interface to jena – I already have assemblers that are correct (it’s just that they won’t run withou colliding into class file version issues). Perhaps just find the source of joseki.rdfserver and copy/paste it into the d2r project? Maybe that’s got a magic “read an assembler and spark up a SPARQL service” method?

If anyone out there has managed to get the d2r assembler working inside fuseki or joseki, or for that matter any implementation of a sparql endpoint, I would be terribly grateful for some tips.

RDF vocabulary – still a problem

February 11, 2015

I probably shouldn’t tell you this, but I have a test instance of d2rq running against a test copy of the new database at

I’m finding that whenever d2rq terminates a request due to a timeout, it seems also to be closing the database connection, or something. I’m not sure it manages transactions properly when its in a web container. Perhaps I need to give it different connection parameters – tell it to use JNDI, perhaps. A problem is that parts of its config measure time in seconds, other parts measure it in milliseconds. I personally have believed ever since 2002 that any variable holding a physical quantity must have a name suffixed with the unit. It would have saved that Mars mission.

By the time you read this, it has probably already crashed and will need to be bounced tomorrow. But hopefully not.

Most of my work over the past few weeks has been to build a d2rq mapping for our data, and to write a corresponding OWL vocabulary. I have not attempted to use existing vocablaries and fit them to what we are doing, instead opting to write a self-contained vocabulary. For terms that – hopefully – map to common things pertaining generally to taxonomy, see, and the other files alongside it: Name.rdf, Author.rdf, Reference.rdf, Instance.rdf. Terms that are more specific to our National Species List application are in

Naturally, it still needs cleaning up. Consider This boolean means “ranks like this need to be shown in the name” – eg, ‘var’. Does this belong in the boa vocabulary? Or is it an artifact of how we produce name strings, belonging in the nsl vocabulary? I don’t know. To a degree, it doesn’t really matter – the main thing is the overall organisation of the data as a model of how the business of taxonomy gets done, and the persistence of the URIs for the individual objects.

Before I continue on to what I actually wanted to write about, I see I need to justify not using existing vocabularies.

First: God knows, we tried. Taxon Concept Schema, OAI-PMH, the TDWG ontology, SKOS, LSIDs – it’s all in there in the current stuff. But there’s endless quibbling about whether or not what SKOS means by a ‘concept’ is really the same as a name or an instance of a name (aka: not all instances are taxon concepts), or whether names are really names in the sense intended by the vocabulary you are trying to use, or if some of them are, and if so which ones (An illegitimate name is a name, but is it really a name? Those kinds of discussions.). Multiply this by the total number of terms you are trying to borrow. Each vocabulary presents a list of ranks, of nomenclatural statuses and what have you, but those lists never quite match what we have. 80% are fine.

The underlying cause of this is that taxonomists – you won’t believe this – just make stuff up! That’s why there isn’t a single set of authoritative nomenclatural statuses. Oh, there almost is (he said, laughing bitterly) – but there’s always one or two that don’t quite fit.
The thing is: they’re not doing it for fun, or to be difficult. They are handling the material they have, which varies from place to place. They generate vocabulary because they have to.
Indeed – they have exactly the problem that the IT people have: sometimes, there is no existing term that wouldn’t be wrong. So every site winds up with idiosyncrasies that the computers must handle.

But you are always having to invent extra vocabulary to make things fit properly. We tried to use TCS and wound up putting chunks of information in the ‘custom data’ section (before giving up on TCS altogether because the schema is so tight that it’s very difficult to generate a TCS document that is correct).

The solution we are going with is just to expose the data we have, with a vocabulary that we publish at the site (currently full of “insert description here” descriptions), and to offload the job of determining whether what we mean is what a querent is asking about onto the querent.

Another job I’ve been meaning to do – demonstrate how to set up an empty instance of Fuseki, configure it to talk to our SPARQL endpoint and the dbpedia endpoint (to take the union of the graphs), and write a sparql query that finds macropods in NSW (according to our data) that have a conservation status of endangered according to dbpedia.
Come to think of it – how about finding names on dbpedia that use synonyms rather than our accepted names? Maybe we could give some starry-eyed undergrads the job of fixing them on Wikipedia in exchange for course credit. Everyone’s a winner.

If you want to write a query across our SPARQL server and dbpedia, by all means go for it. Provided that we keep the ids for the actual data items persistent, we can fool around with the triples we serve up for them at a later stage.

Second, and rather more interestingly: some of what we are doing is – as far as I can make out – new. Some of this new stuff is what I am hoping to write about.

Looking back, this is a little long for a blog post, and if I go on to discuss what I actually want to discuss, then this post would have two separate and almost completely unrelated topics.

So I will change the title of this post and start a new one about our concretization of relationship instances, including the vexing question of what I am supposed to call doing it. ‘Concretization’ is just plain wrong.

— EDIT —

I think I am going to have to go with “realize”. It seems odd, because it’s the object of the verb “abstract”, but it’s the only thing that fits.