Our team’s twitter account is https://twitter.com/AuBiodiversity . Follow for updates in regard to … stuff. Biodiversity stuff.
The magic formula tuned out to be:
- Run joseki
- Using the d2rq libraries
- And just the two joseki jars containing the actual server
- Use a new version of TDB, which uses the new ARQ
And she’s up. Dayum. I have the static vocabulary files, the preloaded old APNI, AFD, and 2011 Catalogue of Life, and a live link to our (test) database, and you can write SPARQL queries over the lot. No worries. A trifle slow if you do stuff that spans the data sources.
Now the boss wants me to write some user doco for it. Item 1 is why this stuff matters (it matters because it’s our general query service). So rather than explaining it all here in my blog, I should do it on confluence.
I would very much like for D2RQ to work as a subgraph inside an existing joseki installation. But sweet baby Jesus I just can’t make the nine million JAR file libraries talk to each other.
Tried to put the D2RQ assembler into joseki. Won’t run without an ‘incubation’ library. Blows up with a class file version exception, which means that the jars were compiled with other versions. Which is nuts, because the version of joseki I am using – 3.3.4 – is the same as is internally inside d2rq.
Tried to put the D2RQ assembler into fuseki. Fuseki attempts to call “init()” on the assembler, which ain’t there. According to the d2rq project, there is no such method on the interface, so clearly d2r was compiled to a different specification of jena than was fuseki.
Tried to launch the joseki that is inside the d2r installation (which obviously works) as a joseki instance rather than a d2r instance. Nope. joseki.rdfserver isn’t there.
Tried to get the d2rq and joseki source so as to compile them together on the same machine. But the build file specifies java versions, the git project HEAD points to a dev branch, and the joseki project isn’t even git.
I am at the stage of hacking up the d2r server code itself – it has the joseki and the d2rq classes living with one another, I have the source and it all compiles and builds ok. Issue is that when it launches, the “go and do it” class creates the d2r graph as a top-level object and as a default graph (from a sparql point of view). This won’t do – I need a top-level graph that is a stuck-together frankenstein that has the d2r component as a mere subsection of what is going on. The “go and do it” returns D2RQModel rather than the interface Model. Happily, I can fix at least that and it still compiles. So maybe I can build the graph that I want internally. But this means learning the programatic interface to jena – I already have assemblers that are correct (it’s just that they won’t run withou colliding into class file version issues). Perhaps just find the source of joseki.rdfserver and copy/paste it into the d2r project? Maybe that’s got a magic “read an assembler and spark up a SPARQL service” method?
If anyone out there has managed to get the d2r assembler working inside fuseki or joseki, or for that matter any implementation of a sparql endpoint, I would be terribly grateful for some tips.
I probably shouldn’t tell you this, but I have a test instance of d2rq running against a test copy of the new database at http://biodiversity.org.au/d2rq.
By the time you read this, it has probably already crashed and will need to be bounced tomorrow. But hopefully not.
Most of my work over the past few weeks has been to build a d2rq mapping for our data, and to write a corresponding OWL vocabulary. I have not attempted to use existing vocablaries and fit them to what we are doing, instead opting to write a self-contained vocabulary. For terms that – hopefully – map to common things pertaining generally to taxonomy, see http://biodiversity.org.au/voc/boa/BOA.rdf, and the other files alongside it: Name.rdf, Author.rdf, Reference.rdf, Instance.rdf. Terms that are more specific to our National Species List application are in http://biodiversity.org.au/voc/nsl/NSL.rdf.
Naturally, it still needs cleaning up. Consider http://biodiversity.org.au/voc/boa/Name#Rank-visible. This boolean means “ranks like this need to be shown in the name” – eg, ‘var’. Does this belong in the boa vocabulary? Or is it an artifact of how we produce name strings, belonging in the nsl vocabulary? I don’t know. To a degree, it doesn’t really matter – the main thing is the overall organisation of the data as a model of how the business of taxonomy gets done, and the persistence of the URIs for the individual objects.
Before I continue on to what I actually wanted to write about, I see I need to justify not using existing vocabularies.
First: God knows, we tried. Taxon Concept Schema, OAI-PMH, the TDWG ontology, SKOS, LSIDs – it’s all in there in the current stuff. But there’s endless quibbling about whether or not what SKOS means by a ‘concept’ is really the same as a name or an instance of a name (aka: not all instances are taxon concepts), or whether names are really names in the sense intended by the vocabulary you are trying to use, or if some of them are, and if so which ones (An illegitimate name is a name, but is it really a name? Those kinds of discussions.). Multiply this by the total number of terms you are trying to borrow. Each vocabulary presents a list of ranks, of nomenclatural statuses and what have you, but those lists never quite match what we have. 80% are fine.
But you are always having to invent extra vocabulary to make things fit properly. We tried to use TCS and wound up putting chunks of information in the ‘custom data’ section (before giving up on TCS altogether because the schema is so tight that it’s very difficult to generate a TCS document that is correct).
The solution we are going with is just to expose the data we have, with a vocabulary that we publish at the site (currently full of “insert description here” descriptions), and to offload the job of determining whether what we mean is what a querent is asking about onto the querent.
If you want to write a query across our SPARQL server and dbpedia, by all means go for it. Provided that we keep the ids for the actual data items persistent, we can fool around with the triples we serve up for them at a later stage.
Second, and rather more interestingly: some of what we are doing is – as far as I can make out – new. Some of this new stuff is what I am hoping to write about.
Looking back, this is a little long for a blog post, and if I go on to discuss what I actually want to discuss, then this post would have two separate and almost completely unrelated topics.
So I will change the title of this post and start a new one about our concretization of relationship instances, including the vexing question of what I am supposed to call doing it. ‘Concretization’ is just plain wrong.
— EDIT —
I think I am going to have to go with “realize”. It seems odd, because it’s the object of the verb “abstract”, but it’s the only thing that fits.
Fish. Or, if you prefer, PISCES. http://biodiversity.org.au/name/PISCES, in fact.
Our new system has scientific names, with author, date, nomenclatural status – the whole thing. And, of course, these names all have URIs with id numbers. Which is ok from the POV of the semantic web, as URIs are opaque identifiers.
But, but wouldn’t it be nice if you actually could read the URI? Wouln’t it be nice to just use “Megalops cyprinoides” or “Herring” as an id? I mean – it’s that the whole point of naming things? The whole point of identifying things is that the name can be used as, well, an identifier. And indeed, it is used that way. All the time. When you go into the nursery and buy a rosebush, the label proudly displaying the scientific name as well as what most people actually call it, they don’t cite the author of the name. There’s no point saying “but without the author, it doesn’t mean anything” because it quite plainly and obviously does mean something. To lots of people.
So we support it. Eg http://biodiversity.org.au/name/Herring.
The main people to whom a name without author doesn’t mean anything are those people engaged in the work of managing names. The information-science aspect of keeping a nice clean set of identifiers available for the rest of the world to use. Taxonomists, and people at the pointy end of the job of working out what actually lives on this planet of ours. They’re the only ones who really care.
Thing is, our databases are built around that work – that’s the mission here. Within that space a bare scientific name actually is pretty much meaningless. But given that we do support these URIs, from a semweb perspective what does http://biodiversity.org.au/name/PISCES, or for that matter http://biodiversity.org.au/name/Herring mean, exactly? What is it the URI of? What, in a word, is its referent?
I think that the referent of http://biodiversity.org.au/name/Herring has got to be the word “Herring”. That’s all. Perhaps it’s even owl:sameAs text:Herring – if ‘text’ were a URI schema.
Our database does not have anything in it whose identifier is simply ‘Herring’. However, we do have a bunch of other stuff that might be of interest. In particular, we have a name whose URI is http://biodiversity.org.au/afd.name/246141. If you ask for Herring, which is an id that is not in our database, you will get that.
Is this a legit thing to do? The URI http://biodiversity.org.au/name/Herring is kind of in a limbo. On the one hand, the URI exists – you don’t get a 404. On the other hand, we don’t serve up any data pertaining to the URI that you have referenced. You ask for ‘Herring’. You get something else that isn’t Herring (one of its fields is the text value ‘Herring’, but what does that mean?). Now, you and I understand this, but a semantic web engine isn’t going to. To put it another way, over the entire graph hosted at our server, http://biodiversity.org.au/name/Herring sits on its own and isn’t connected to anything.
So. To address this, my plan is to create an object whose URI is http://biodiversity.org.au/name/Herring. It will have a type of – ooh, I dunno – boa-name:SimpleName, I’ll give it an rdf:value of ‘Herring’ and a number of predicates named – say – boa-name:isSimpleNameOf (or probably just do it in reverse: boa-name:hasSimpleName on the name object). There you go.
The situation is slightly different for http://biodiversity.org.au/taxon/Herring. The meaning of this is rather more specific. This URI is an alternative name – owl:sameAs – for the current APC or AFD concept for the valid or vernacular name whose simple name string is ‘Herring’. It’s the accepted taxon for the name at b.o.a .
That is, there is not a separate object. It’s an alternative URI for an object that we already host. And it may change over time: that’s the nature of semantic web data. Actually implementing this in d2r … I don’t know. I might need to build a few database views for the various ways this fact might be derived.
At least that’s the plan. If there are outright collisions of simple names in accepted taxa among our different domains, then, well – this will need to be re-thought a bit. It may very well be that http://biodiversity.org.au/taxon/Herring is not usable as an id, but that http://biodiversity.org.au/taxon/animalia/Herring might be.
In any event. The goal is to give these convenience URIs a home and maybe even make them mean something.
Whenever a name is used, it is used somewhere. Every instance of a name appearing in print is either de novo, or it’s a citation of that name appearing somewhere else. That’s logic.
And so, our “name” table should be treated as an optional one-to-one table on name instances. Those name instances that are nomenclatural events
(the act of giving a specimen a name) have additional data about the name they establish, and other instances cite them. In most cases where a name is simply used, it should be treated as a citation of the protonymic instance.
In other words, the NAME_ID on each instance that is not a protonym is – in principle – a derived field. It is (just making up some notation here) “instance is citation of instance”* -> “instance creates name”, over the real world of publications and the names in them.
In a perfect world. We don’t have the whole of the real world in our database. Very few databases do.
Here are a couple of issues:
Common names. Common names sort of exist in the aether, there is no nomenclatural event that creates them. There are also many names which even if they are real, scientific names created validly under whatever code governs them – we don’t necessarily have the creating instance in our data (eg, stuff that doesn’t occur in australia, obscure papers, other reasons).
Invalidly published names that are subsequently validly published. Someone names a specimen, but they didn’t dot the i’s and cross the t’s (often, it’s that they didn’t describe the specimen properly. In Latin, dammit, like what God talks.). Someone else subsequently – sometimes even the same author in the same year – does it right. Now, from one point of view the second work is citing the first. But from the point of view of scientific naming, that first name doesn’t “count” as really being a
name. So things that want to cite the protonym ought to be citing the second occurrence, not the first. I think what happens here is that one is the protologue and the other is the protonym.
Chains of citations. If we in the database modelled it the way it “really” is, you would have to walk the chain of citations to get at the actual name.
So what’s the upshot of this?
The upshot is that of course we have a name table, and of course every instance holds a pointer to the name that is an instance of. What I’m saying is that this should be viewed as a denormalised data structure that – for convenience – doesn’t exactly model what’s really going on, and that’s perfectly ok.
Have I blogged about this before?
Mo matter – I’ll re-do it again, redundantly. We are re-visiting this topic because we are looking about applying this pattern to the new NSL database. The strength of the pattern is that it can be applied to an existing data set without too much trouble.
- Database changes
- Changes to existing code
- The merge operation
- The undo operation
- Deleting a duplicate record
- Deleting a record that is not a duplicate
- Semantic Web Implications
One of the issues in AFD some time ago was duplicate records for publications laying about in the database. Having duplicate records means that when an editor searches for a publication they get a list of identical records and don’t know which one to use, and it makes it difficult to produce such basic things as “what AFD names were published in this paper?”
The AFD people wanted a fix for this. They wanted to be able to resolve duplicates, and they also wanted to be able to undo. Preferably at any level, in any sequence.
For my part, I needed a solution which could be back-fitted to the existing data and not break the various things external to the AFD which relied on the data.
So, after a brief complaint that what they were asking was impossible and a walk through the gardens, the de-duplication algorithm was born. It’s a way of re-engineering a database to allow duplicate records to be merged, without disrupting too much else.
First, it’s obvious that in order to make the thing work without disrupting everything else, then the publication_ids on the existing records must get updated. Alternatives involving joining to another table (containing the publication id you are supposed to be using) are not acceptable – too much work to fix all the things affected. Another possibility would be to rename the affected tables, and create a view with the old name that would compute correct values for the fields. This struck me as again way too much work and load.
However, the business of wanting to make the de-dupication undoable at any level would seem to mean that we have to keep some sort of journal for each record using a publication id. But actually, this is not the case. All you need to do is keep track of the publication id prior to any de-duplication activity. The question of what gets put where is all implied in the “I got marked as a duplicate” field in the publication record itself.
This does mean that changes to the publication id not managed by the de-duplication framework are not journaled … but so what if they aren’t? If a user corrects a record then, well, is what it is.
It also means that redundant data is stored. If publication A is merged into B, then every thing that initially used A will now use B. Since we are storing A, and the fact that we need to use B could be computed from A and the merge history over publications, the fact that B can be computed means that is therefore redundant. My decision about this was: so what, I don’t care.
Rather than keeping the original, manually assigned publication id in the table that it originally came from, a possibility would be to drop them into another table. I chose not to do this because writing a hibernate mapping for another table would be a drag, and because one column of that table would have to be the name of the table and column that owns the id. We would have to store “that publication.parent_publication_id was originally 5678″. This means we are modelling a database in a database, an antipattern, and means that we can’t put foreign key constraints on that source_id column – 1234 might be a publication id or it might be a reference id in our particular case.
I elected not to be excessively clever with publications containing sub-items (chapters in books, papers in journals). If a book in our data has chapters 1 and 2, and a duplicate record for the book has chapters 2 and 3, then when the duplicates are merged, the book will have two chapter 2 records. But fixing this automatically is difficult – the algorithm will either be cautious to the point of seldom actually doing anything, or it will merge things that oughtn’t be merged. My decision was that if a user is working in that space (de-duplicating records), then they’d be reviewing the results and could see the duplicate chapter anyway.
Finally, we do have business rules needing to be respected. A publication of type ‘paper’ must appear as a child record of a publication of type ‘journal’. But the rule “you may only merge publications with the same type” won’t do, because occasionally the problem is that something that should have been put in as a paper was also put in as a monograph. So the rule was relaxed to “must have the same type, or have no child records”.
So: here’s what you actually do.
Pick an entity that might have duplicates needing to be managed. Let’s say Author.
Whip through the database and add a second field everywhere there’s a foreign key, which will contain the “this is the id that was assigned prior to any de-duplication”. In AFD, I prepended the field name with “original”. Now, Name has an author and a base author, and reference also has an author, so we are talking adding a
original_author_id, original_base_author_id to Name, and original_author_id to reference.
(on reflection, maybe I ought to have stuck the “original” in just in front of the _ID, to keep the fields together)
I will speak of these pairs as each having an ‘original’ value and a ‘current’ value.
Finally, add a ‘duplicate_of_id” to the Author table. I will speak of author records as being “current” or “duplicate” according to whether or not this field is null.
The iron rule of this pattern is: a current id never points at a duplicate record.
Any operation that assigns an author to a author_id or base_author_id field must set (or clear) both the current and the original value fields. These operations must never assign the ids of duplicate records. In practise this means that the picklists that you display to users must not be populated with duplicate records, but that’s almost the entire point of the exercise, anyway, so it’s all good. (The only time the user sees the old, duplicate records is when they are engaged in the task of managing duplicates).
If that doesn’t seem like much – well, that is the point of doing it this way. You don’t need to hack up too much existing code.
Deletion of Author records will be discussed further on.
The merge operation may be performed over any two current Authors (subject to other business rules, which may require a little thought). This ensures that the ‘replaced_by_id” graph is acyclic.
As to the question of which way to do it, well by definition it shouldn’t matter, so if we are not merging as directed by the user, then make the newer record the duplicate – this means that older identifiers are preserved and makes the set of identifiers more stable over time (I might have stolen that idea from a related field of information management).
To mark an Author record B as being a newer duplicate of some other record A,
– update the current ids on each of those id pairs to A wherever it is currently B
– mark Author B duplicate_of_id A.
“But!” (you may ask) “What about the case where C got merged into B previously?” Well, the current id of all relevant records will be B. Not a problem. The original ids of those pairs will be C, but we are not touching that.
The undo operation requires a tree walk. However, it will probably not be a very deep or large tree walk, unless you have a system that (for instance) merges large datasets together daily, automatically de-duplicates identifiers, and these de-duplications occasionally need to be corrected by hand. Even then – if my macbook running postgres can handle it, then whatever you are using should be able to handle it.
To undo the merge of (say) F into E
– find all publications that were merged into F (including F itself) with a treewalk
– then update the current id of every pair whose original id is in that set
– set the replaced_by_id of F to null
The results of the tree walk can be put into a transaction scoped temporary table, or you can simply stuff it into a subquery. Thus:
update name set author_id = 'F' where original_author_id in ( with recursive treewalk(id) as ( select 'F' as id union all select author.id from treewalk join author on author.replaced_by_id = treewalk.id ) select id from treewalk )
Once for each id pair, wherever it may occur in the schema.
“But!”, you may protest, “this won’t work! It will miss stuff!”. I leave it as an exercise for the reader to satisfy themself that it works perfectly fine, and is awesome.
To delete an author that is a duplicate record, let’s say record R which is a duplicate of earlier record Q
– for every pair that has an original id of R, change its original id to Q
– for every author that is marked as a duplicate of R, change it to be marked as a duplicate of Q
– delete R
“But!”, you may protest, “What about those fields whose current id is R? Won’t the foreign key whinge when its referent disappears?”
No current id points to R. R is marked as a duplicate, all current ids pointing at it were moved on when it was so marked.
This means deleting not only the record (which we will call record L), but all records marked as duplicates of L. Other possibilities (reducible to other operations) are:
– undo the “mark as duplicate of L” for all records that are duplicates of L, then continue
– mark L as a dulpicate of some other record, then delete it as a duplicate record
After the “are you sure?” dialog, the steps are these.
– set to null all id pairs whose current id is ‘L’
– delete L and all Authors in its duplicate-of tree
The interesting thing here is that if you treat duplicate-of-id as itself being one of those id pairs, then the ‘mark as a duplicate’ operation will keep the current duplicate_of_id up-to date, meaning that you can find all author records in the duplicate-of tree of a current record straight away.
If this is done, then rather than using dulicate_of_id=null for records that are not duplicate records, set it to self. That way, your joins always join. Of course, queries looking to select only current records need to be adjusted.
Should you use owl:sameAs?
sameAs means “These two ids are ids for the same thing.” There are two ways to go.
You can regard your URI as being the id of the author. If so, your data should not expose the duplicate records at all, except to say that their ids are sameAs. This means that you cannot expose the de-duplication history at all. It’s meaningless to say that F is replaced by E if F and E are both names for the same thing.
Alternatively, you can treat the id as being the id of the author record. In this case, the URIs are not same-as and it’s meaningful to expose those de-duplication events as predicates. However, they may not really be of interest to anyone. I suppose they become interesting when a person is querying about the provenance of facts, rather than what the facts say.
Perhaps a correct approach would be by way of reification. Let’s say author T is a duplicate of author S. it might be sensible to declare
:book1 :author :S; :duplicationInfo [ a rdf:Statement; a :propertyAssignmentThatsAResultOfDeduplication; rdf:subject :book1; rdf:predicate :author; rdf:object :S; :originalValue :T; ]; .
We do have a dcterms:isReplacedBy, but I do not know offhand what other standard vocabularies support this pattern particularly.