I worked at defense with a bloke named Russel Clinch – a very, very fine team leader, architect and all-around competent professional. I remember one session in particular. We spent most of a morning deciding what we were going to name certain classes. That’s all – just the names of the classes and what they did. And we both regarded it absolutely as effort well-spent.
Currently, one of our upcoming tasks will be to build a system allowing people to manage taxonomic trees. One of the things this system will do is to expose the results as RDF. The problem is – what do we call the nodes?
Obviously, the first response is “the nodes are taxa, duh”. Well, there are a couple of problems with this.
What is a taxon? Or TaxonConcept, in RDF-speak? It’s somebody’s idea of a meaningful, useful way to split out a subset of living things. In our world, a ‘taxon’ is usually a formally published idea of such a subset. There are a number of ways of specifying a taxon: describing a species, naming a type specimen, declaring a list of synonyms, I suppose gene sequences.
Thing is, one person may take genus A and organize it like so:
- A a
- A b
- A c
And another person like so:
- A (a)
- A (a) a
- A (a) b
- A c
- A (a)
When this happens, they are talking about the same taxon A. It’s the same name, the same published description, the same subset of all possible living things. They have just split it up differently internally. But they are different tree nodes. So I need to call it something different.
So what should I call them? Well, ‘Tree’ and ‘TreeNode’ or just ‘Node’. But I really dislike using these very very generic names, because they are names for computing artifacts, not names for what is happening in the subject-matter space. (also, the trees are not strictly tree-like: we have hybrids and whatnot)
Now, taxa are arranged into trees by taxa being placed under other taxa. So I think a much better naming convention would be to call our data items each a Placement. A set of (related) placements is an Arrangement. Some arrangements are very important – large, identified by a well-known name, managed by someone and so on. These arrangements are each a Classification.
Now, my question becomes – does a Placement correspond to a node in our tree? There are two problems with this, exemplified by hybrids:
A –isfemaleparentof-> A × B; and
B –ismaleparentof-> A × B .
Our two problems are that there are two placements for this tree node, and that the placements must be decorated with data – a relationship type.
So, ok – an arrangement is primarily a set of arcs. But what data goes in the nodes? After a quick think, I am persuaded: nothing at all! That is, nothing that isn’t already in the Taxon Concept, which is stored as part of the reference1.
That is, a Placement is a subclass of TDWG Relationship, which isPartOf some Arrangement. An Arrangment is a Set of Placement. Classification is a subclass of Arrangement. The computer system that permits working with these arrangments does not need to ‘do’ taxa at all – it can simply refer to them by their URI. Of course in a working system we will probably need a table of some sort for the nodes, but conceptually its not really a thing that’s needed.
Of course, there is a fly in the ointment.
In the actual system we are thinking of building at some point in the future (it’d be nice to be able to get started before end of FY, but that’s not really my department) we will probably need to create anonymous nodes in our trees. You know, for speculative purposes, work that’s just ‘playing around’.
… actually, forget what I was going to say. I was going to talk about a ‘one level deep’ arrangement being a suitable end-point for a toTaxon or fromTaxon predecate, but the whole world will be way, way simpler if we just pretend that these objects are TaxonConcepts. End of that discussion.
My only problem now is that we want, in our system, to be able to say that an arrangement is made up of bits of other arrangements … and dammit if that isn’t going to mess up absolutely everything I have just said. I absolutely do need to be able to refer to tree nodes.
Ok. So having moved everything important into the arcs, what do I make of these tree node items? Well, a tree node is nothing but a set of subnodes. It refers to these subnodes by a set of arcs whose “from” node is the tree node. But a set of arcs is an Arrangement. So it seems that we have a special kind of ‘one layer deep’ arrangement object. Call it a … goddammit I’ll call it a ‘node’ for now. It makes sense to say ‘a node includes itself and includes and any nodes included by its subnodes’.
Drat, I am going to have to rethink this. One way to handle it would be to say that a placement may be part of many Arrangements, but that would mean recursively adding things. Another would be to just pretend, for purposes of RDF, that the nodes don’t exist unless they are used in other arrangements, but that breaks the link to the name and changes types in uneven ways when this is done.
Perhaps what I need to do is have a special kind of placement which means ‘from this point down, use that placement over there belonging to some other arrangement’. No, actually it would just need to say ‘use some other arrangement from here’.
Drat – a big problem is that I want to be able to edit these things and track the edits. If I model this as each edit being an arrangement in its own right, then I run back into the problem of having to recursively copy things down. Mind you – that’s useful data and it’s something we might want to index. It does affect my current thinking on how the editing process should be done … but it affects it in ways that make better sense than what I am thinking of doing now.
Look – I’ll tell you what. I’ll stop typing at this point. I still think calling ’em ‘arrangement’ instead of ‘tree’, ‘placement’ instead of ‘tree link’, and understanding that a ‘tree node’ (ugh) is not the same thing as a ‘taxon’ is a good idea.
1 “reference”, in this context, means “a publication you refer to”. Think “the reference section at the local library”. We are not using the term in the sense of a memory address, a handle in a computer system