Arduino success

November 7, 2015

So, I picked up a project on the Arduino “Gigs and Collaborations” board. What the guy wanted seemed pretty straightforward, a half-an-hour job. I said I’d do it for fifty bucks.

You poor fool.

Over the week, our correspondence became increasingly heated.

To be fair, much of it was my fault. I simply hadn’t read a lot of what he wrote, but (again, too be fair) it was written in somewhat original syntax and was really a lot of work to decode. I was fooled by his address and his european-looking nom de internet. I have no doubt that this gentleman spent his formative years on the subcontinent or in regions further east.

Anyway. On the 5th, I sent a message with an attached sketch (arduino projects are called ‘sketches’), the message describing what the attached code does now. His reply to me contained – among other things – this beauty:

CAMERA NOTE: WHETHER THE CAMERA IS POWERED ALL THE TIME, OR JUST FOR 12 THE SECONDS I ASK TO HAPPEN WITH THE CALL TO “ACTION”. WHEN YOU HIT THE REC BUTTON IT RECORDS FOR 12 SECONDS AND STOPS IT CAN ONLY RECORD A 12 SECOND CLIP. IN OTHER WORDS,, YOU HAVE TO PUSH (TRIGGER) THE RECORD BUTTON AGAIN TO MAKE IT RECORD AGAIN. I USE THE 12 SECOND POWER VIDEO (THE “ACTION”) TO SAVE POWER. SO THE CAMERA DOESN’T RUN ALL NIGHT LONG. JUST WHEN IT IS CALLED. I DON’T SEE A NEED FOR “WAIT FOR MOTION SENSOR TO GO LOW?? BECAUSE IT “WILL”, MIGHT TAKE A MINUTE OR 2 OR 3, AND THAT DOESN’T MATTER. BUT IT WILL GO LOW.AFTER IS DOES, SIMPLY…PLAY ACTION!

Which frankly, I have only read in full just now. Now that I have completed the project, with hindsight I can see what the bloke is saying. (He’s wrong, by the way. Without a wait, the logic would trigger the start-up sequence over and over).

So what did I do? I’ll tell you what I did, man. I wrote a long screed with some word in CAPS saying “you said this, then; and you are saying this now”. Then I deleted it without sending it. With age, comes wisdom. Instead, I sent him this:

Ok, well the sketch I sent you last time should do what you want, then.

Have you tried it out?

Well, the situation is that he can only really try it out on the weekend. Dude has a job.

Today, I got this:

good morning Paul, it appears to work exactly as anticipated. I am very pleased, played with it all day yesterday. money will be in your account by the end of the day.

please keep my contact info. if you need a reference give them my email, I will speak highly of you. as with most of my prototypes, they seldom work perfectly the first time out, or, I discover “OH!” I want it to do this too!

so there may be some tweeking of this one needed down the road. if not this one, I have other projects for you to work on after this one is finished.

great job!

Fucking “Booyah!” and fist-pump.

Dude may be just a tiny bit optimistic about the possibility that I might accept further work from him. But this was definitely a success.

Also learned quite a bit about accepting work over the internet and corresponding with clients.


Git squash

June 15, 2015

Just for my own reference:

I do a lot of work on branches, and I like to check them in periodically as I achieve chunks of work. But it’s quite granular – way too granular for the main branch. So I’d like to compress them down into a single revision for whatever feature I am implementing for the main branch.

Here’s how it’s done. Let’s say I’m on branch NSL-1168 (I name my branches for their JIRA id). Everything needs to be clean and checked in.

First, the state of master after the most recent merge of NSL-1168 into it is tagged (to begin using this process, I have to do that manually). Then:

Grab the comments on all my incremental changes and concatenate them:

echo 'NSL-1168 squash merge' `date` > ~/tmp/merge_comment
echo >> ~/tmp/merge_comment
git log --ancestry-path NSL-1168-last-merge..NSL-1168 >> ~/tmp/merge_comment

Merge master into branch. At this stage, work may need to be done to make the merge work. Hopefully not.

git checkout NSL-1168
git merge master

Squash merge the branch into master. This is the magic bit. We merge and commit with the big long comment detaling all the changes.

git checkout master
git merge --squash NSL-1168
git commit -F ~/tmp/merge_comment

Finally, tidy up. The squash merge doesn’t produce an explicit record that the master and my branch were synced, so I merge master into the branch again – a no-op – just to document inside git that the branches are the same at that point.

Finally, I move the ‘last-merge’ tag. All my incremental changes from this point will go into the master as a bundle next time I do this.

git checkout NSL-1168
git merge master
git tag -f NSL-1168-last-merge

This leaves the branch one commit ahead of the master, which is ok. We do not merge the branch into the master because not doing that is the whole point of the exercise. Ultimately, the master will look like the branch has never been merged into it. But that’s ok – the squash merges will do the job. This means also that if the JIRA branch is deleted from the repository, then all of those incremental changes will be unreferenced and will disappear. Again – this is ok.


SKOS and classifications

April 29, 2015

“Oh, and we’d really like SKOS as well”.

Riiiight.

We have names and references.
Names appear in references at “instances” of a name.
Some name instances are taxonomic concepts.

We have a classification made up of tree nodes.
A tree node has many subnodes.
A tree node may be used by many supernodes as a subtree.
A tree node (for classification trees) refers to a taxonomic concept instance.

Now here’s the thing.

The basic job of SKOS in our data is to say “this concept is narrower-than that concept”. X ∈ A → X ∈ B. Naively, we’d put this over the node ids and say job done.

But our tree nodes are most certainly not taxonomic concepts. They are placements of concepts in some classification. It’s the concepts that are concepts. So the skos output must declare that it is the taxon concepts that are narrower-than other taxon concepts.

The problem becomes that the data becomes a mishmash. If a concept is moved from one place to another, then our classification data (which holds the entire history of a tree) will declare in one node that A⊂B and in another that that A⊂C, where B and C are different taxa at the same level.

Not good.

The problem becomes clearer when you consider having two entirely separate and incompatible classifications. It makes sense to say A⊂B according to the people at NSW University, but A⊂C according to the Norwegian National Classification of Herring. These classifications are sources of assertions that you can choose to trust or to not trust.

The nature of our classification nodes becomes clear. Each node is a document – a source of triples – that asserts certain facts and that trusts the assertion of facts in the nodes below it.

All I need to do is to find a suitable ontology for expressing this. The W3C provenance ontology looks promising. It even has – Oh My Lord! I has these predicates

prov:wasRevisionOf prov:invalidatedAtTime prove:wasInvalidatedBy

Which match exactly certain columns in my data.

(edit: why is this important? It indicates that the provenance ontology and my data may very well be talking about the same kinds of thing. If your data has a slot for “number of wheels” and “tonnage” and so does mine, there’s a good chance we are both discussing vehicles.)

I’m pretty comfortable, at this point, deciding that each tree node in a classification is a prov:Entity object. In fact, it seems to be a prov:Bundle in the sense of its subnodes, and a prov:Collection in the sense of the reified subnode axioms, which are skos terms.

I’m probably over-engineering things again. But there you go.


PS: not all subnode relationships are skos broader/narrower term relationships. In particular, hybrids. A hybrid cucumber/pumkin is not simply a kind of cucumber in the same way that a broadbean is a type of legume. I cannot simply assert skos inclusion over a tree – the actual link types and node types matter.


JENA, D2R, TDB, Joseki – happy together at last

March 3, 2015

Well!

The magic formula tuned out to be:

  1. Run joseki
  2. Using the d2rq libraries
  3. And just the two joseki jars containing the actual server
  4. Use a new version of TDB, which uses the new ARQ

And she’s up. Dayum. I have the static vocabulary files, the preloaded old APNI, AFD, and 2011 Catalogue of Life, and a live link to our (test) database, and you can write SPARQL queries over the lot. No worries. A trifle slow if you do stuff that spans the data sources.

Now the boss wants me to write some user doco for it. Item 1 is why this stuff matters (it matters because it’s our general query service). So rather than explaining it all here in my blog, I should do it on confluence.


JAR file hell with Jena and D2RQ

February 24, 2015

I would very much like for D2RQ to work as a subgraph inside an existing joseki installation. But sweet baby Jesus I just can’t make the nine million JAR file libraries talk to each other.

Tried to put the D2RQ assembler into joseki. Won’t run without an ‘incubation’ library. Blows up with a class file version exception, which means that the jars were compiled with other versions. Which is nuts, because the version of joseki I am using – 3.3.4 – is the same as is internally inside d2rq.

Tried to put the D2RQ assembler into fuseki. Fuseki attempts to call “init()” on the assembler, which ain’t there. According to the d2rq project, there is no such method on the interface, so clearly d2r was compiled to a different specification of jena than was fuseki.

Tried to launch the joseki that is inside the d2r installation (which obviously works) as a joseki instance rather than a d2r instance. Nope. joseki.rdfserver isn’t there.

Tried to get the d2rq and joseki source so as to compile them together on the same machine. But the build file specifies java versions, the git project HEAD points to a dev branch, and the joseki project isn’t even git.

I am at the stage of hacking up the d2r server code itself – it has the joseki and the d2rq classes living with one another, I have the source and it all compiles and builds ok. Issue is that when it launches, the “go and do it” class creates the d2r graph as a top-level object and as a default graph (from a sparql point of view). This won’t do – I need a top-level graph that is a stuck-together frankenstein that has the d2r component as a mere subsection of what is going on. The “go and do it” returns D2RQModel rather than the interface Model. Happily, I can fix at least that and it still compiles. So maybe I can build the graph that I want internally. But this means learning the programatic interface to jena – I already have assemblers that are correct (it’s just that they won’t run withou colliding into class file version issues). Perhaps just find the source of joseki.rdfserver and copy/paste it into the d2r project? Maybe that’s got a magic “read an assembler and spark up a SPARQL service” method?

If anyone out there has managed to get the d2r assembler working inside fuseki or joseki, or for that matter any implementation of a sparql endpoint, I would be terribly grateful for some tips.


RDF vocabulary – still a problem

February 11, 2015

I probably shouldn’t tell you this, but I have a test instance of d2rq running against a test copy of the new database at http://biodiversity.org.au/d2rq.

I’m finding that whenever d2rq terminates a request due to a timeout, it seems also to be closing the database connection, or something. I’m not sure it manages transactions properly when its in a web container. Perhaps I need to give it different connection parameters – tell it to use JNDI, perhaps. A problem is that parts of its config measure time in seconds, other parts measure it in milliseconds. I personally have believed ever since 2002 that any variable holding a physical quantity must have a name suffixed with the unit. It would have saved that Mars mission.

By the time you read this, it has probably already crashed and will need to be bounced tomorrow. But hopefully not.

Most of my work over the past few weeks has been to build a d2rq mapping for our data, and to write a corresponding OWL vocabulary. I have not attempted to use existing vocablaries and fit them to what we are doing, instead opting to write a self-contained vocabulary. For terms that – hopefully – map to common things pertaining generally to taxonomy, see http://biodiversity.org.au/voc/boa/BOA.rdf, and the other files alongside it: Name.rdf, Author.rdf, Reference.rdf, Instance.rdf. Terms that are more specific to our National Species List application are in http://biodiversity.org.au/voc/nsl/NSL.rdf.

Naturally, it still needs cleaning up. Consider http://biodiversity.org.au/voc/boa/Name#Rank-visible. This boolean means “ranks like this need to be shown in the name” – eg, ‘var’. Does this belong in the boa vocabulary? Or is it an artifact of how we produce name strings, belonging in the nsl vocabulary? I don’t know. To a degree, it doesn’t really matter – the main thing is the overall organisation of the data as a model of how the business of taxonomy gets done, and the persistence of the URIs for the individual objects.

Before I continue on to what I actually wanted to write about, I see I need to justify not using existing vocabularies.

First: God knows, we tried. Taxon Concept Schema, OAI-PMH, the TDWG ontology, SKOS, LSIDs – it’s all in there in the current stuff. But there’s endless quibbling about whether or not what SKOS means by a ‘concept’ is really the same as a name or an instance of a name (aka: not all instances are taxon concepts), or whether names are really names in the sense intended by the vocabulary you are trying to use, or if some of them are, and if so which ones (An illegitimate name is a name, but is it really a name? Those kinds of discussions.). Multiply this by the total number of terms you are trying to borrow. Each vocabulary presents a list of ranks, of nomenclatural statuses and what have you, but those lists never quite match what we have. 80% are fine.

The underlying cause of this is that taxonomists – you won’t believe this – just make stuff up! That’s why there isn’t a single set of authoritative nomenclatural statuses. Oh, there almost is (he said, laughing bitterly) – but there’s always one or two that don’t quite fit.
The thing is: they’re not doing it for fun, or to be difficult. They are handling the material they have, which varies from place to place. They generate vocabulary because they have to.
Indeed – they have exactly the problem that the IT people have: sometimes, there is no existing term that wouldn’t be wrong. So every site winds up with idiosyncrasies that the computers must handle.

But you are always having to invent extra vocabulary to make things fit properly. We tried to use TCS and wound up putting chunks of information in the ‘custom data’ section (before giving up on TCS altogether because the schema is so tight that it’s very difficult to generate a TCS document that is correct).

The solution we are going with is just to expose the data we have, with a vocabulary that we publish at the site (currently full of “insert description here” descriptions), and to offload the job of determining whether what we mean is what a querent is asking about onto the querent.

Another job I’ve been meaning to do – demonstrate how to set up an empty instance of Fuseki, configure it to talk to our SPARQL endpoint and the dbpedia endpoint (to take the union of the graphs), and write a sparql query that finds macropods in NSW (according to our data) that have a conservation status of endangered according to dbpedia.
Come to think of it – how about finding names on dbpedia that use synonyms rather than our accepted names? Maybe we could give some starry-eyed undergrads the job of fixing them on Wikipedia in exchange for course credit. Everyone’s a winner.

If you want to write a query across our SPARQL server and dbpedia, by all means go for it. Provided that we keep the ids for the actual data items persistent, we can fool around with the triples we serve up for them at a later stage.

Second, and rather more interestingly: some of what we are doing is – as far as I can make out – new. Some of this new stuff is what I am hoping to write about.

Looking back, this is a little long for a blog post, and if I go on to discuss what I actually want to discuss, then this post would have two separate and almost completely unrelated topics.

So I will change the title of this post and start a new one about our concretization of relationship instances, including the vexing question of what I am supposed to call doing it. ‘Concretization’ is just plain wrong.

— EDIT —

I think I am going to have to go with “realize”. It seems odd, because it’s the object of the verb “abstract”, but it’s the only thing that fits.


Fish jelly, and linked data.

February 11, 2015
You can’t nail jelly to a tree. But you can put it in a bucket and slap a label on the bucket

Fish. Or, if you prefer, PISCES. http://biodiversity.org.au/name/PISCES, in fact.

PISCES (as xml) (as json)
URI http://biodiversity.org.au/afd.name/244465
LSID urn:lsid:biodiversity.org.au:afd.name:244465
modified 2013-10-17
Title PISCES
Complete PISCES
Nom. code Zoological
Uninomial PISCES
Taxonomic events
PISCES sensu AFD afd

Our new system has scientific names, with author, date, nomenclatural status – the whole thing. And, of course, these names all have URIs with id numbers. Which is ok from the POV of the semantic web, as URIs are opaque identifiers.

But, but wouldn’t it be nice if you actually could read the URI? Wouln’t it be nice to just use “Megalops cyprinoides” or “Herring” as an id? I mean – it’s that the whole point of naming things? The whole point of identifying things is that the name can be used as, well, an identifier. And indeed, it is used that way. All the time. When you go into the nursery and buy a rosebush, the label proudly displaying the scientific name as well as what most people actually call it, they don’t cite the author of the name. There’s no point saying “but without the author, it doesn’t mean anything” because it quite plainly and obviously does mean something. To lots of people.

So we support it. Eg http://biodiversity.org.au/name/Herring.

The main people to whom a name without author doesn’t mean anything are those people engaged in the work of managing names. The information-science aspect of keeping a nice clean set of identifiers available for the rest of the world to use. Taxonomists, and people at the pointy end of the job of working out what actually lives on this planet of ours. They’re the only ones who really care.

Thing is, our databases are built around that work – that’s the mission here. Within that space a bare scientific name actually is pretty much meaningless. But given that we do support these URIs, from a semweb perspective what does http://biodiversity.org.au/name/PISCES, or for that matter http://biodiversity.org.au/name/Herring mean, exactly? What is it the URI of? What, in a word, is its referent?

I think that the referent of http://biodiversity.org.au/name/Herring has got to be the word “Herring”. That’s all. Perhaps it’s even owl:sameAs text:Herring – if ‘text’ were a URI schema.

Our database does not have anything in it whose identifier is simply ‘Herring’. However, we do have a bunch of other stuff that might be of interest. In particular, we have a name whose URI is http://biodiversity.org.au/afd.name/246141. If you ask for Herring, which is an id that is not in our database, you will get that.

Is this a legit thing to do? The URI http://biodiversity.org.au/name/Herring is kind of in a limbo. On the one hand, the URI exists – you don’t get a 404. On the other hand, we don’t serve up any data pertaining to the URI that you have referenced. You ask for ‘Herring’. You get something else that isn’t Herring (one of its fields is the text value ‘Herring’, but what does that mean?). Now, you and I understand this, but a semantic web engine isn’t going to. To put it another way, over the entire graph hosted at our server, http://biodiversity.org.au/name/Herring sits on its own and isn’t connected to anything.

So. To address this, my plan is to create an object whose URI is http://biodiversity.org.au/name/Herring. It will have a type of – ooh, I dunno – boa-name:SimpleName, I’ll give it an rdf:value of ‘Herring’ and a number of predicates named – say – boa-name:isSimpleNameOf (or probably just do it in reverse: boa-name:hasSimpleName on the name object). There you go.

The situation is slightly different for http://biodiversity.org.au/taxon/Herring. The meaning of this is rather more specific. This URI is an alternative name – owl:sameAs – for the current APC or AFD concept for the valid or vernacular name whose simple name string is ‘Herring’. It’s the accepted taxon for the name at b.o.a .

That is, there is not a separate object. It’s an alternative URI for an object that we already host. And it may change over time: that’s the nature of semantic web data. Actually implementing this in d2r … I don’t know. I might need to build a few database views for the various ways this fact might be derived.

At least that’s the plan. If there are outright collisions of simple names in accepted taxa among our different domains, then, well – this will need to be re-thought a bit. It may very well be that http://biodiversity.org.au/taxon/Herring is not usable as an id, but that http://biodiversity.org.au/taxon/animalia/Herring might be.

In any event. The goal is to give these convenience URIs a home and maybe even make them mean something.