March 17, 2014

So I attempted to explain what I had in mind at work today, and managed to determine that my ideas are pretty half-baked. The more I think about it, the more it’s clear that the new model needs to do everything that the tree model does.

The tree idea is pretty mature at this stage, and does almost everything. What it doesn’t do is the main take-away of the list idea: for the current version of a tree, every item is numbered in sequence. You can get all the Pinacea by finding the taxon, asking it what row range it covers, and then selecting and ordering by row number over that range.

Can this notion be back-fitted onto the tree idea? Yes it can.

First, the subnodes of a node need to have a definite order.

Next, the notion of “a checklist” corresponds to “a tree root”. This is to say, I need to distinguish between the whole history of a tree, and a particular point in that history. I’m thinking of calling them ‘tree’ and ‘checklist’. ‘Tree’ becomes a partition of the data set, to which things like permissions and user groups might be attacked.

Each node, in addition to belonging to a ‘tree’, belongs to some particular checklist and has a sequence within that checklist. This needs to get updated when a versioning is performed. Basically, these tree roots are much more explicitly managed by the system.

Old checklists still need to be tree-walked, as we are doing now. There’s no easy way to get item 50 in checklist 9 once checklist 9 has been replaced. But although there’s no easy way of doing it, it can still be done with a recursive query which can zip down the tree to exactly the right spot. The trick is to note the offset between where checklist 9 says the node is and where the current checklist (checklist 11) says it is, and to carry that offset down the tree walk. To tell the truth, I suspect it will work rather well, and it will work far better than the current model which just has the nodes floating around in space.

Yep – retain the existing versioning model, which I am confident about, and add a notion of ‘absolute position in the current checklist’. Keeping that correct will be an interesting and important addition to the existing update queries, but it probably wont need a load of completely new stuff. One extra table to explicitly hold the checklist roots, and some new fields to hold numbers.

(note to self – store the depth as well. Makes it easy to produce an indented list with just the select by row range.)


March 15, 2014


We have a table of the checklists. A “checklist” is a version at a particular time, so there are multiple records for ‘APC’ but only one of these will be the current APC checklist, with a null replaced_by.

id Checklist id
copy_of_id Checklist history pointers
persisted_at_ts Checklist history timestamps
Other data fields. Labels, owner, long title, uri, etc.

We have two tables for the items in the checklists: checklist_p and checklist_v, p=’physical’ and v=’virtual’.

Checklist_p always contains the current version of every checklist. In particular, checklist_p contains the taxon ids. The data structure is

id The physical id
cl_id The checklist id
row The row number relative to the checklist (starting at row 0).
parent_row Tree structure in doubly-linked grapevine format. I’m thinking about using offsets rather than numbers for reasons which shall be explained.
(I might make these names shorter)
Taxon id and other data items

Note that this makes it straightforward to read out current lists in row order, and to check that current lists do not have duplicate taxa. If we slap a unique constraint on cli_id/taxon_uri, then job done.

Row number zero has a null taxon_id, it’s used so that a checklist may contain multiple items at the top level.

Checklist_v contains the arrangement of all checklists. It is similar in structure to checklist_p – a grapevine list using row numbers. Each item in the list refers to another list item either in checklist_v or checklist_p. Each item in the list either uses the arrangement of sub nodes from the thing referred to, or has its arrangement in checklist_v.

This means that checklist_v, while being structured like checklist_p has entries missing – they are located elsewhere. You can see why I am thinking that storing the tree structure as offsets might be the go. It means that when you chase a reference, you don’t have to keep track of an adjustment to the row numbers referred to.

My main concern is whether or not checklist_v should be permitted to refer to itself. The problem with this is that it means that reconstructing old checklists becomes a tree-walk of arbitrary depth. The problem with *not* doing it is that if a node is rearranged so that it no longer appears in checklist_p, then the node as it was needs to be reconstructed in checklist_v everywhere that it appears. In effect, checklist_v fills up with crud that we don’t really care about anymore but we have to keep.

But it’s not as bad as it appears – nodes get moved (more on this in a minute), so checklist_v self-references don’t follow a history, they point to the latest place where the node appears in that form. The very cool thing about this is that we can show that on a printout of an old, reconstructed list.

To continue.

The basic job of editing is this:

You construct a new checklist made up of bits of other checklists and new data which you enter. You can opt to push an entry in your list into another (current) list – replacing a node in that list with a node in yours. Note that inserts and deletes are done by replacing the parent node of the node you are trying to insert.

When you do this, a new checklist is created. The nodes from your list and the unaltered nodes from the target list are moved into the new list and replaced with references.

And, well, and that’s it. (TODO: what happens to the NSL list? it has to be left in permanent “I am not a current list” state, or something.)

No, actually, that’s not it. The other issue is where two people are hammering away at the same list. A replacement of a node at high level means that the checkout needs to be turned into a checkout of the new node. Unaltered sibling nodes need to be fixed up …. I still need the notion of a ‘tracking’ link. Which is kind of comforting. This system is probably isomorphic to my git-like nodes system, just redone with row numbers to make working with the current list much easier.

A worked example

Actually – no. I’ll start coding, then once that’s done do an update and then dump the results. Which is how those graphs were done.

March 13, 2014

A checkin is ‘copy this list fragment from that list over the top of this fragment’.

If a bit is unchanged as a result of the checkin (ie, everything else), then conceptually it gets moved to the new list and replaced with a reference. Cool. Now, if the physical I’d is unchanged, this means that physical ids will always point to the latest occurrence of the item. The item needs to be usdataed to say “I belong to this list, now”. A node should be marked, then, with the place where it was first made. This way, we can say that a list contains Polychaeta as per AFD timestamp-to-timestamp.

March 13, 2014

List data structure should be breadth-first. When a list includes a sublist, we are only talking including a range of the highest-level nodes.

But this makes it difficult to pull the list out with a simple query. Perhaps not.


March 13, 2014

Back-fitting becomes a non-issue if we agree that references to ranges in other lists are references to the entire, virtual expanded list.
Checkins become simple if a given list only contains a taxon id once. Conceptually, replace every taxon in the old list with the single taxon in the new list, then collapse ranges (this won’t preserve placement structure)
A list only uses ranges from the list it is replaced-by. So you do need to do a tree walk to reconstruct an old list. But this is ok.
A checkout involves copying an entire list. Need to make this work with fragments.
Whenever a list item is a taxon rather than a reference to some other list fragment, then that’s what gets assigned a persistent id.


March 13, 2014

My boss suggested something, and now I am re-thinking how to do the whole tree thing

1 – represent the tree as an indented list
2 – list items may be borrowed from other lists. This is key, because we don’t want zillions of copies of all possible list floating around
3 – how to do the history? Well, same as now, but a “list” is simply a node that is used in more than one place
4 – this means that a change to a genus forces all genera in the family to become their own list. We don’t want this, so list items need to be ranges pulled from other lists.
4 – this will result in a list being made up of fragments of old lists. We don’t want this, because the most important job is finding a name in the current list. So a basic job is back-fitting. When a list is replaced with a new list, all places using the list need to be adjusted so that they are using the new list (or fragments thereof) as far as possible. This may mean that a list item reduces down to a single fragment, which means that can be put into the lists that use it and the single-fragment list gets deleted.

I don’t know if this is possible.

I’ll have to think about it.

Some progresss

March 4, 2014

How the heck did we ever program before google?

Beats me.

I now have a servlet which will permit you to call stored procedures in a postgres database, passing in and getting back JSON. Simple? Well, yes. But the code underneath has to convert things between JSON strings, arrays, record sets and back again.

The goal is that the inner stored procedures should not have to talk JSON at all. There’s a layer to unpack and repack that does that.

For instance: a thing1 has many thing2s. I want to pass back an array of thing1s to the caller. So theres code to build a JSON object from a set of columns from thing2: row_to_json, to convert a result set into an array: array, to convert that array into a single json object: array_to_json, and finally to stuff that into a json object built from the columns of thing1: row_to_json.

Of course, this function is passed an array of thing1 ids, which are integers that come in as a json array. So:

  _result := thing1_array(
      select a.value::text::int 
      from json_array_elements(request_json) a

Ten years ago, the only way to get this done would be to take a few days to thoroughly read the docs. Now, it’s google and trial-and-error.

What else? A consistent framework for passing messages back to the caller, allowing the functions deep down in the call hierarchy to whinge to the user helpfully, and rolling back by throwing exceptions around. The return json *always* is one json object, with a key named ‘ok’ and one named ‘messages’ – other keys as needed depending on the operation.

Oh – I should be returning a meaningful http response code. 404 for “that record id is rubbish”, 409 when we can’t do an update as a result of complicated business rules.

Fun times. Yeah, I know there’s 9000 frameworks that do all this. I have expressed my opinion of frameworks before. My ‘framework’ is 148 lines that go like this:

  public class XxxJson extends HttpServlet {
    protected void service(…) {
      … stuff goes here
      PrintWriter out = new PrintWriter(response.getOutputStream());

Old-school. ‘Cause I’m old.


Get every new post delivered to your Inbox.

Join 70 other followers