Ever had one of those moments?


One of those “oops, this ain’t going to work” moments?

This post is stream-of-consciousness, thinking aloud. But there’s a github link at the bottom.

I’m building a system to implement a curated directed graph structure. The idea is that node are added to the graph in batches.

You have an existing graph G. You create a new set of nodes F. The rule is – nodes in F may refer to nodes in G by arcs only in one direction. I think of directionality as running downwards, because I am creating trees. F is a new tree (or set of trees) which may have nodes in G attached as leaves.

The central notion is a “versioning” operation. Every node is either “current” or is “replaced” by another node. In a versioning operation, you specify a set of current nodes (in G, unsually) and replace each with some other current node (in F). The versioning operation takes the transitive closure of the supernodes of the nodes to be replaced, generates new nodes for any of those that aren’t already being replaced, and creates the new set of nodes. Finally, all arcs of current nodes that point at a versioned node are moved to point at the new node.

That is, graph manipulation happens as a set of atomic operations (ie: one database transaction) affecting sets of nodes. At the end of the transaction, the graph is still in a consistent state.

The main motivation for this is that you don’t want your graph to hold the entire history of every little edit a user makes. Nodes go through a sequence of being draft nodes, current nodes, and then finally replaced nodes. Draft nodes are never edited, they are only ever replaced by other current nodes.

The thing is – you can replace any current nodes with any other current nodes you like, subject to an easily computed check, and the versioning algorithm will still leave the graph acyclic.

I introduced two different types of arc: a fixed arc and a versioning arc. Fixed arcs are not moved by the versioning operation. This is fine, the algorithm still works.

actually, I can stop typing now because I think I have fixed my problem. Thanks, code-review teddy bear! But I’ll go on describing it :)

-edit– no, no actually I haven’t fixed it. Dang. Still need to keep typing.

I introduced a third type of arc – a ‘tracking’ arc. This arc would be updated but would not trigger a versioning ‘up the tree’. The main use for this is in creating bookmark-like objects which have a persistent identity, but whose immediate subnodes always reflect the current state of some sequence of nodes.

While useful, this type of node link means that a versioning process can introduce loops.

Lets say node A has a subnode B. Each time B us replaced with a new version – B1, B2, node A is cloned to a new version. According to my rules, any current node may be replaced by any other current node, with the rule that you may not replace a node with a node that is going to be …

OK. Now I think I have it.

My previous rule was “you may not replace a node with any node that is going to have a new version as part of this versioning operation”. The importance of this rule is that you do not need to scan the whole graph for possible loops that might be created. Performing this check – while requiring recursive sql – is computationally feasible. In our subject matter area, where these trees have a high ‘fan out’ but are no more than a dozen or so levels deep: kingdom → phylum → class → order → family → tribe → genus → species → subspecies, each rank potentially being split into “suborders” and “supertribes”, etc . A versioning might entangle a few hundred nodes, but not tens of thousands.

I hope.

To handle tracking arcs, I am going to have to make that rule “any node that is going to be updated as part of this versioning operation”, which means nodes that are going to be versioned UNION nodes that hold tracking links to nodes that are going to be versioned.

This is a fairly small change, it won’t be expensive to compute, and will not affect the operation of the system for its intended purpose.

Code is here http://github.com/PaulMurrayCbr/RBoaTree

Note that it is NOT FINISHED YET, and DOESN’T QUITE WORK YET. I’ll tag it once I have a version that I think works.


No, goddamit, this isn’t going to work either. you may have the sequence

A -> B -> C -tracking link> D -> E

and the user tries to replace E with B. This will result in the graph

A -> B -> C -tracking link> D1 -> B

Thing is, if that’s a fixed link then this operation is ok – although it will mean that nod D1 is orphaned.

An additional issue is that I want tracking links to be updated even if the supernode is and old ‘replaced’ node. This means that tracking links always point to a current node, always reflect a current, latest state. The problem is that nodes fan out in the upward direction qiute a bit as a result of versioning. Without tracking links, this is not a problem because my ‘check for loops’ step only has to worry about current nodes. If tracking links mean that I must check the old nodes, then that’s a no-can-do: the number of nodes to worry about explodes.

Options

1 – actually, this isn’t a problem
2 – tracking links of old nodes are not updated. I don’t like this, because the state of a tracking link at the time a node was versioned is meaningless, and this might not fix the problem.
3 – tracking links are removed when a node is replaced. Easiest solution, but it means that data gets lost. In business space terms, “profile items” are only ever attached to current nodes. Or to put it another way, if you are looking at an old tree then to find the profile items for a node you have to look at whatever its current replacement is.
3a – when a node is versioned, tracking links are converted into fixed links. (isn’t this logically the same as option 2?)
4 – introduce a node type for each node and a partial ordering over those types. You may only attach a tracking link to a node further down the type hierarchy. “taxa” may track “profile items”, but not other taxa. Yuck. I want to do this without introducing having to maintain a metadata schema as part of your data.

Goddammit.

I think my rule is gong to have to be:

“You may not replace a node with a node in set K, where K is all nodes being replaced and any current supernodes linked via versioning or tracking links.”

Realistically, I could make this “any current supernode”. It does mean that certain operations are restricted that don’t need to be, but those operations don’t make any sense.

Let’s take

A -> B -FIXED> C -> D

and replace D with B. This is legal, and results in a new node

C1 -> B

It’s an orphan, but only because I am not showing the other trees that C participates in.

Will my less-restrictive rule do the job?


Ok. Ok. THIS time I think I’ve got it.

You cannot use any node that is a super node of any node that is tracking a node being versioned.

I was going to make this more permissive – limit the prohibited nodes to those specific nodes that are super nodes of each specific one, but this creates a possibility of cross linking. That is

A -tracking> AA
B -tracking> BB

and the version set {BB new node is A, AA new node is B}. This would result in the loop A->B->A … . Checking for these cross refs is way too expensive to do, so I’ll prohibit the use of A or B or any supernode of A or B as a versioning target.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: