October 27, 2014

Well, there are a dozen or so code snippets on the web to deal with this. I found one that worked. It’s not ideal – it recompiles the template each iteration, rather than reusing the already compiled template and cloning it, but meh.

What I am really trying to do is to is to show the user a view of this data structure that makes sense to them

The trick is that each node is the head of a tree, which means an indented list. Each node has a history of updates, which in the general case forms two trees forward and back. But it is usually the case that the updates form a single line of history. You can see that in this image as pairs of red and blue arrows. The blue arrows are branches, the red arrows merges.

Each node also potentially appears in multiple contexts if you go up the tree. In the general case (not in this digram), these contexts may be in different spaces. A user may create a tree by grafting together parts of other trees.

So, I want a single-pane tree explorer. With this component, I need to be able to do three jobs.

First, show two panes next to one another and highlight the differences and common nodes. I want the user to be able to pull up a node, the previous node, and to be able to see the spots where the tree has changed. Here’s a demo of the kind of layout I am thinking about.

Second, edit a tree. I want to have a tree in draft mode on the left, the explorer on the right, and for the use to be able to drag-and-drop nodes onto their draft tree.

Thirdly, the apply changes tool. Pull up a tree that is a finalised edit, the tree into which these edits are to be put, and show what nodes are going to be replaced. Again, the general case of this can be quite complex.

Complications? We got ‘em. In particular, access control. You can pull in other people’s trees, but that doesn’t mean you can update them. Another one is asynchronous updates. If you pull in part of someone else’s tree, and that person updates their tree, then your tree gets updated (this is a key part of the system). What happens on the client side when the underlying data changes?

But, all the tools are there, all the bits. It’s a matter of assembling them into something that works. We have a guy on the project who is very good at UI: colours, fonts, layout – that sort of thing. So I can make it work and then worry about the CSS later.


October 26, 2014

“So why do you want to use AngularJS”?
“Because my technical lead says to use it.”

Job done. No longer my problem :) .

I had a bash at making it go today. My goal was: to make a webpage that would act as an expanding classification list using the JSON data available at . It’s pretty cool, but I am having one rather nasty drawback.

The expanding list template, of course, needs to include itself as a sub-template. What’s happening is that Angular is freezing on startup – eventually Firefox says “hey – this script is unresponsive, you want to kill it?”. Thing is: if you do kill it, then press the “open ANIMALIA” button, nothing happens. Press it again, and the taxa start openning up and everything is sweet.

So obviously the angular template compiler is in some sort of “before I compile this, I need to compile the things it includes” loop. I need some way to explain to it that it is not to do this. It seems to be able to defer template compilation, because it eventually does work ok.

Oh – other issue is cross-site scripting, of course. I had to write a local webapp to serve up the pages that also had a little servlet to proxy the JSON from .

But, if I can get that initialization sorted out, it’s pretty sweet.

If this doesn’t work this time, I am going home anyway

October 8, 2014

The thing about having this curated tree thingumajig is that it’s general enough to handle, well, when ever you have a hierarchically organised document that changes over time.

I have managed to get the APC (Australian Plant Census) classification into it (although this is failing in TEST because the psql sequence for ids is capped at 10 million). But the australian plant census is not the only classification around. There is the vexing problem of “what family is this name in?” for – well – everything, really.

You see, names are built from other names. A species name has a genus. But something as basic as family if not really part of the name, as such. We need a classification for where we have put the names as far as our internal systems are concerned: all names – excluded, miscellaneous, what have you. Prior to this, that data was simply part of the name table. As I understand it, the problem as the names people work through what it all means, is that that is simply wrong.

So we need another classification alongside APC in my tree thing.

And this is absolutely not a problem at all. It’s built to do this. It will take the APC, and AFD trees, as well as the APNI classification and all of our other APNI-like classifications (AMANI and so on). Oh, and I have saved a slot for the herbarium classification – ‘CANB’.

But right now, I am attempting to load 188340 names from APNI into my long-suffering postgres instance running on my macbook. In a single transaction. It’s churning its little heart out, but happily it has a solid-state drive so I don’t have to listen to it weep.

It looks frozen. Wait a moment! I just got another line of output! Maybe I need to break this up somehow. Or I could just cheat (because I know that all of the nodes that are currently in state foo need to have bar done to them in this particular import).

9PM, I am going home.

It has been some time

October 2, 2014

It has been a while singe I blogged.

The National Species List team is now proceeding with development, with a focus on replacing the existing APNI Oracle Forms app. Data-wise, there are two important components.

The ‘names’ component is the main focus of APNI, and the one that grapples with the whole question of how taxonomic names are used in scientific publications. The basic idea is first, that whenever a name is used, it’s always used somewhere, and the author who uses that name is almost always citing it from somewhere else – this is simply how the whole process of scholarship works. Second, when a document contains more than one name, they are usually in some sort of relationship to each other at the place they appear.

Thus name ∩ reference ≡ instance. An instance is (almost always) a citation of some other instance, and it has other instances that it cites in the same place (obviously, this is done as a one to many cited_by value on the target). Of course, these relationships have many, many, many different types and sorting through what they are is a challenge for the scientists. Thus a paper may talk about a name, state that another name used elsewhere is a misspelling, and that a further name used somewhere else is a synonym.

The thing to note is that component mainly stores facts about published names – Dr So-and-so did indeed publish a paper where he used certain names and said certain things about them.

Most of the work done so far has been on this component and on a rather nice web app to replace APNI.

The other opponent is the tree component – my bit. My main focus over the past week or two has been getting the APC classification into it. This is nowhere near as straightforward as you might suppose.

The APC data is ink a table, the pertinent bits being “this id was an apc concept from this date to that date, and it was placed under this other id”.

I have dealt with the issue where the other id does not have a date range that matches the ranges of its sub-ids. This happens a lot, owing to the history of APC. Botanists, it seems, really don’t care much about higher classification above family. So Genus A got declared as being put under family F, then at a later date family F was added to APC (with no declared supertaxon), then at a later date still, once the higher classification was assembled (phylum Ph, class C, order O), F was put into order O where it belongs.

In the data, it looks roughly like this:

Taxon Supertaxon From To
G F 1998 ongoing
F 1999 2002
C Ph 2000 ongoing
O C 2000 ongoing
F O 2003 ongoing

My system deals with this by creating “declared supertaxon” nodes for ids that are declared as being supertaxa over time ranges where those supertaxa are not APC concepts. That way, if five genera were put in a family that wasn’t in APC until later, that information is captured. You can see the APC tree grow over time – I should post some screenshots.

This isn’t my current problem.

My current problem is this:

Many APC records declare ids that are not APC instances, but the names on those instances are the same as some other instance that is in APC. This happens whenever someone went “we are not using Dr A’s publication for genus G anymore, we are using Dr B’s.” Of course, all the species (S*) in genus G still declare that Dr A’s version of G is their supertaxon.

So the final step of my import is “After loading everything, make a change using right now as the date mark. Anything that has a declared supertaxon that isn’t in APC, where there is a supertaxon that is in APC, should be moved there.”

And this blew up, blew up, blew up. Eventually I found the problem. You see, four or five APC records declare that their supertaxon is the same as the taxon. I ignore that, no probs. But there are also some that declare a supertaxon with the same name as themselves. Which hadn’t been a problem until now. But this new final step attempts to move those taxa to be under themselves.

The nice thing is that my tree manipulation layer catches this nonsense and throws an exception. That is – we don’t wind up with screwed-up data in the final tables. The versioning algorithm says “WTF?” and correctly throws an exception. Which, as a certain colleague of mine would say, is quite pleasing.

So, I wrote a fix. It ran against a test. It didn’t run against the full load. Is this because my fix doesn’t work, or because there is some other even more devious and subtle problem in the data? (UPDATE: more devious and subtle, it seems.)

Here’s an example:

Blue arrows = ‘this node was copied to that node’, a version branch.
Red arrows = ‘this node was replaced by that node’, a version merge.
Black/grey lines = subtaxa
black = current, grey = no longer current
rectangle = taxon
hexagon = unnamed classification root
oval = none of the above. There is one of these: a tag I attach to the APC import itself, prior to the final mixup operations. This tag doesn’t have a name, but there is a comment on it.
flag = a named tag
no arrowhead = a versioning link
dot arrowhead = a tracking link
anchor arrowhead = a fixed link

APC concept 39864 is for apni instance 242777 with name 309388. It declares its supertaxon to be apni instance 242775 which also has name 309388. My system responds by moving all subtaxa of 242775 to 242777 except for 242777 itself. This is moved up to the root of the APC classification. This leaves 242775 dangling there by itself – but that’s not a problem big enough to worry about.

This diagram also shows a final tree operation in which no changes were made (purely because this is only a partial data set). You can see that operation in the move from 4965798:4965842 to 4965798:4965856.

In any case. I printed out the 8 problematic records and gave them to one of the scientists here. She was like “What the hell? Oh I see – the Genus isn’t in APC.” Son in the long term, these problems simply will be fixed.


March 17, 2014

So I attempted to explain what I had in mind at work today, and managed to determine that my ideas are pretty half-baked. The more I think about it, the more it’s clear that the new model needs to do everything that the tree model does.

The tree idea is pretty mature at this stage, and does almost everything. What it doesn’t do is the main take-away of the list idea: for the current version of a tree, every item is numbered in sequence. You can get all the Pinacea by finding the taxon, asking it what row range it covers, and then selecting and ordering by row number over that range.

Can this notion be back-fitted onto the tree idea? Yes it can.

First, the subnodes of a node need to have a definite order.

Next, the notion of “a checklist” corresponds to “a tree root”. This is to say, I need to distinguish between the whole history of a tree, and a particular point in that history. I’m thinking of calling them ‘tree’ and ‘checklist’. ‘Tree’ becomes a partition of the data set, to which things like permissions and user groups might be attacked.

Each node, in addition to belonging to a ‘tree’, belongs to some particular checklist and has a sequence within that checklist. This needs to get updated when a versioning is performed. Basically, these tree roots are much more explicitly managed by the system.

Old checklists still need to be tree-walked, as we are doing now. There’s no easy way to get item 50 in checklist 9 once checklist 9 has been replaced. But although there’s no easy way of doing it, it can still be done with a recursive query which can zip down the tree to exactly the right spot. The trick is to note the offset between where checklist 9 says the node is and where the current checklist (checklist 11) says it is, and to carry that offset down the tree walk. To tell the truth, I suspect it will work rather well, and it will work far better than the current model which just has the nodes floating around in space.

Yep – retain the existing versioning model, which I am confident about, and add a notion of ‘absolute position in the current checklist’. Keeping that correct will be an interesting and important addition to the existing update queries, but it probably wont need a load of completely new stuff. One extra table to explicitly hold the checklist roots, and some new fields to hold numbers.

(note to self – store the depth as well. Makes it easy to produce an indented list with just the select by row range.)


March 15, 2014


We have a table of the checklists. A “checklist” is a version at a particular time, so there are multiple records for ‘APC’ but only one of these will be the current APC checklist, with a null replaced_by.

id Checklist id
copy_of_id Checklist history pointers
persisted_at_ts Checklist history timestamps
Other data fields. Labels, owner, long title, uri, etc.

We have two tables for the items in the checklists: checklist_p and checklist_v, p=’physical’ and v=’virtual’.

Checklist_p always contains the current version of every checklist. In particular, checklist_p contains the taxon ids. The data structure is

id The physical id
cl_id The checklist id
row The row number relative to the checklist (starting at row 0).
parent_row Tree structure in doubly-linked grapevine format. I’m thinking about using offsets rather than numbers for reasons which shall be explained.
(I might make these names shorter)
Taxon id and other data items

Note that this makes it straightforward to read out current lists in row order, and to check that current lists do not have duplicate taxa. If we slap a unique constraint on cli_id/taxon_uri, then job done.

Row number zero has a null taxon_id, it’s used so that a checklist may contain multiple items at the top level.

Checklist_v contains the arrangement of all checklists. It is similar in structure to checklist_p – a grapevine list using row numbers. Each item in the list refers to another list item either in checklist_v or checklist_p. Each item in the list either uses the arrangement of sub nodes from the thing referred to, or has its arrangement in checklist_v.

This means that checklist_v, while being structured like checklist_p has entries missing – they are located elsewhere. You can see why I am thinking that storing the tree structure as offsets might be the go. It means that when you chase a reference, you don’t have to keep track of an adjustment to the row numbers referred to.

My main concern is whether or not checklist_v should be permitted to refer to itself. The problem with this is that it means that reconstructing old checklists becomes a tree-walk of arbitrary depth. The problem with *not* doing it is that if a node is rearranged so that it no longer appears in checklist_p, then the node as it was needs to be reconstructed in checklist_v everywhere that it appears. In effect, checklist_v fills up with crud that we don’t really care about anymore but we have to keep.

But it’s not as bad as it appears – nodes get moved (more on this in a minute), so checklist_v self-references don’t follow a history, they point to the latest place where the node appears in that form. The very cool thing about this is that we can show that on a printout of an old, reconstructed list.

To continue.

The basic job of editing is this:

You construct a new checklist made up of bits of other checklists and new data which you enter. You can opt to push an entry in your list into another (current) list – replacing a node in that list with a node in yours. Note that inserts and deletes are done by replacing the parent node of the node you are trying to insert.

When you do this, a new checklist is created. The nodes from your list and the unaltered nodes from the target list are moved into the new list and replaced with references.

And, well, and that’s it. (TODO: what happens to the NSL list? it has to be left in permanent “I am not a current list” state, or something.)

No, actually, that’s not it. The other issue is where two people are hammering away at the same list. A replacement of a node at high level means that the checkout needs to be turned into a checkout of the new node. Unaltered sibling nodes need to be fixed up …. I still need the notion of a ‘tracking’ link. Which is kind of comforting. This system is probably isomorphic to my git-like nodes system, just redone with row numbers to make working with the current list much easier.

A worked example

Actually – no. I’ll start coding, then once that’s done do an update and then dump the results. Which is how those graphs were done.

March 13, 2014

A checkin is ‘copy this list fragment from that list over the top of this fragment’.

If a bit is unchanged as a result of the checkin (ie, everything else), then conceptually it gets moved to the new list and replaced with a reference. Cool. Now, if the physical I’d is unchanged, this means that physical ids will always point to the latest occurrence of the item. The item needs to be usdataed to say “I belong to this list, now”. A node should be marked, then, with the place where it was first made. This way, we can say that a list contains Polychaeta as per AFD timestamp-to-timestamp.


Get every new post delivered to your Inbox.

Join 75 other followers