It has been a while singe I blogged.
The National Species List team is now proceeding with development, with a focus on replacing the existing APNI Oracle Forms app. Data-wise, there are two important components.
The ‘names’ component is the main focus of APNI, and the one that grapples with the whole question of how taxonomic names are used in scientific publications. The basic idea is first, that whenever a name is used, it’s always used somewhere, and the author who uses that name is almost always citing it from somewhere else – this is simply how the whole process of scholarship works. Second, when a document contains more than one name, they are usually in some sort of relationship to each other at the place they appear.
Thus name ∩ reference ≡ instance. An instance is (almost always) a citation of some other instance, and it has other instances that it cites in the same place (obviously, this is done as a one to many cited_by value on the target). Of course, these relationships have many, many, many different types and sorting through what they are is a challenge for the scientists. Thus a paper may talk about a name, state that another name used elsewhere is a misspelling, and that a further name used somewhere else is a synonym.
The thing to note is that component mainly stores facts about published names – Dr So-and-so did indeed publish a paper where he used certain names and said certain things about them.
Most of the work done so far has been on this component and on a rather nice web app to replace APNI.
The other opponent is the tree component – my bit. My main focus over the past week or two has been getting the APC classification into it. This is nowhere near as straightforward as you might suppose.
The APC data is ink a table, the pertinent bits being “this id was an apc concept from this date to that date, and it was placed under this other id”.
I have dealt with the issue where the other id does not have a date range that matches the ranges of its sub-ids. This happens a lot, owing to the history of APC. Botanists, it seems, really don’t care much about higher classification above family. So Genus A got declared as being put under family F, then at a later date family F was added to APC (with no declared supertaxon), then at a later date still, once the higher classification was assembled (phylum Ph, class C, order O), F was put into order O where it belongs.
In the data, it looks roughly like this:
My system deals with this by creating “declared supertaxon” nodes for ids that are declared as being supertaxa over time ranges where those supertaxa are not APC concepts. That way, if five genera were put in a family that wasn’t in APC until later, that information is captured. You can see the APC tree grow over time – I should post some screenshots.
This isn’t my current problem.
My current problem is this:
Many APC records declare ids that are not APC instances, but the names on those instances are the same as some other instance that is in APC. This happens whenever someone went “we are not using Dr A’s publication for genus G anymore, we are using Dr B’s.” Of course, all the species (S*) in genus G still declare that Dr A’s version of G is their supertaxon.
So the final step of my import is “After loading everything, make a change using right now as the date mark. Anything that has a declared supertaxon that isn’t in APC, where there is a supertaxon that is in APC, should be moved there.”
And this blew up, blew up, blew up. Eventually I found the problem. You see, four or five APC records declare that their supertaxon is the same as the taxon. I ignore that, no probs. But there are also some that declare a supertaxon with the same name as themselves. Which hadn’t been a problem until now. But this new final step attempts to move those taxa to be under themselves.
The nice thing is that my tree manipulation layer catches this nonsense and throws an exception. That is – we don’t wind up with screwed-up data in the final tables. The versioning algorithm says “WTF?” and correctly throws an exception. Which, as a certain colleague of mine would say, is quite pleasing.
So, I wrote a fix. It ran against a test. It didn’t run against the full load. Is this because my fix doesn’t work, or because there is some other even more devious and subtle problem in the data? (UPDATE: more devious and subtle, it seems.)
Here’s an example:
Blue arrows = ‘this node was copied to that node’, a version branch.
Red arrows = ‘this node was replaced by that node’, a version merge.
Black/grey lines = subtaxa
black = current, grey = no longer current
rectangle = taxon
hexagon = unnamed classification root
oval = none of the above. There is one of these: a tag I attach to the APC import itself, prior to the final mixup operations. This tag doesn’t have a name, but there is a comment on it.
flag = a named tag
no arrowhead = a versioning link
dot arrowhead = a tracking link
anchor arrowhead = a fixed link
APC concept 39864 is for apni instance 242777 with name 309388. It declares its supertaxon to be apni instance 242775 which also has name 309388. My system responds by moving all subtaxa of 242775 to 242777 except for 242777 itself. This is moved up to the root of the APC classification. This leaves 242775 dangling there by itself – but that’s not a problem big enough to worry about.
This diagram also shows a final tree operation in which no changes were made (purely because this is only a partial data set). You can see that operation in the move from 4965798:4965842 to 4965798:4965856.
In any case. I printed out the 8 problematic records and gave them to one of the scientists here. She was like “What the hell? Oh I see – the Genus isn’t in APC.” Son in the long term, these problems simply will be fixed.