It has been some time

It has been a while singe I blogged.

The National Species List team is now proceeding with development, with a focus on replacing the existing APNI Oracle Forms app. Data-wise, there are two important components.

The ‘names’ component is the main focus of APNI, and the one that grapples with the whole question of how taxonomic names are used in scientific publications. The basic idea is first, that whenever a name is used, it’s always used somewhere, and the author who uses that name is almost always citing it from somewhere else – this is simply how the whole process of scholarship works. Second, when a document contains more than one name, they are usually in some sort of relationship to each other at the place they appear.

Thus name ∩ reference ≡ instance. An instance is (almost always) a citation of some other instance, and it has other instances that it cites in the same place (obviously, this is done as a one to many cited_by value on the target). Of course, these relationships have many, many, many different types and sorting through what they are is a challenge for the scientists. Thus a paper may talk about a name, state that another name used elsewhere is a misspelling, and that a further name used somewhere else is a synonym.

The thing to note is that component mainly stores facts about published names – Dr So-and-so did indeed publish a paper where he used certain names and said certain things about them.

Most of the work done so far has been on this component and on a rather nice web app to replace APNI.

The other opponent is the tree component – my bit. My main focus over the past week or two has been getting the APC classification into it. This is nowhere near as straightforward as you might suppose.

The APC data is ink a table, the pertinent bits being “this id was an apc concept from this date to that date, and it was placed under this other id”.

I have dealt with the issue where the other id does not have a date range that matches the ranges of its sub-ids. This happens a lot, owing to the history of APC. Botanists, it seems, really don’t care much about higher classification above family. So Genus A got declared as being put under family F, then at a later date family F was added to APC (with no declared supertaxon), then at a later date still, once the higher classification was assembled (phylum Ph, class C, order O), F was put into order O where it belongs.

In the data, it looks roughly like this:

Taxon Supertaxon From To
G F 1998 ongoing
F 1999 2002
C Ph 2000 ongoing
O C 2000 ongoing
F O 2003 ongoing

My system deals with this by creating “declared supertaxon” nodes for ids that are declared as being supertaxa over time ranges where those supertaxa are not APC concepts. That way, if five genera were put in a family that wasn’t in APC until later, that information is captured. You can see the APC tree grow over time – I should post some screenshots.

This isn’t my current problem.

My current problem is this:

Many APC records declare ids that are not APC instances, but the names on those instances are the same as some other instance that is in APC. This happens whenever someone went “we are not using Dr A’s publication for genus G anymore, we are using Dr B’s.” Of course, all the species (S*) in genus G still declare that Dr A’s version of G is their supertaxon.

So the final step of my import is “After loading everything, make a change using right now as the date mark. Anything that has a declared supertaxon that isn’t in APC, where there is a supertaxon that is in APC, should be moved there.”

And this blew up, blew up, blew up. Eventually I found the problem. You see, four or five APC records declare that their supertaxon is the same as the taxon. I ignore that, no probs. But there are also some that declare a supertaxon with the same name as themselves. Which hadn’t been a problem until now. But this new final step attempts to move those taxa to be under themselves.

The nice thing is that my tree manipulation layer catches this nonsense and throws an exception. That is – we don’t wind up with screwed-up data in the final tables. The versioning algorithm says “WTF?” and correctly throws an exception. Which, as a certain colleague of mine would say, is quite pleasing.

So, I wrote a fix. It ran against a test. It didn’t run against the full load. Is this because my fix doesn’t work, or because there is some other even more devious and subtle problem in the data? (UPDATE: more devious and subtle, it seems.)

Here’s an example:

Blue arrows = ‘this node was copied to that node’, a version branch.
Red arrows = ‘this node was replaced by that node’, a version merge.
Black/grey lines = subtaxa
black = current, grey = no longer current
rectangle = taxon
hexagon = unnamed classification root
oval = none of the above. There is one of these: a tag I attach to the APC import itself, prior to the final mixup operations. This tag doesn’t have a name, but there is a comment on it.
flag = a named tag
no arrowhead = a versioning link
dot arrowhead = a tracking link
anchor arrowhead = a fixed link

APC concept 39864 is for apni instance 242777 with name 309388. It declares its supertaxon to be apni instance 242775 which also has name 309388. My system responds by moving all subtaxa of 242775 to 242777 except for 242777 itself. This is moved up to the root of the APC classification. This leaves 242775 dangling there by itself – but that’s not a problem big enough to worry about.

This diagram also shows a final tree operation in which no changes were made (purely because this is only a partial data set). You can see that operation in the move from 4965798:4965842 to 4965798:4965856.

In any case. I printed out the 8 problematic records and gave them to one of the scientists here. She was like “What the hell? Oh I see – the Genus isn’t in APC.” Son in the long term, these problems simply will be fixed.


One Response to It has been some time

  1. Paul Murray says:

    The more devious and subtle problem was when a node needing a new name has a subnode that also needed to be shifted to a new name. No – that’s not right: it was when a node that was an instance under which other instances needed to be put also itself needed to be moved to a correct name. My algorithm failed depending on the sequence in which it tried to do the resolution, which meant it depended on what order the queries happened to pull out the problematic records.

    I fixed this by saying “if a node needs to be moved under a correct name, then if this node is the correct name for other nodes and consequently has has to have a new version made for it, then move that new version to the place where it has got to go. Otherwise, move the node (kinda).”

    Sadly, it is *still* bowing up. I have yet another problem – or my implementation of the solution for this problem is faulty. The load takes 20 minutes to run, and is far too big to graph, so to fix these bugs I need to isolate a set of records that together reproduce them. This is not easy.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: