Having looked through the neo4j manual, I am not convinced that it is a good fit for what we are trying to accomplish.
- The underlying model does not fit well into RDF.
- It is not a triple store.
- It is not based around URIs
- It does not support SPARQL out of the box – it needs add-ons
- It does not appear to support separate disk files for partitioning data
- Cypher (the neo4j query language) is not a standard in the same way that SPAQRL is
- Cypher is still being developed (although there is a mechanism for backward compatibility)
These problems can all be addressed, but they will require add-ons and work-arounds to do so.
The underlying model
Neo4j stores a graph of nodes and arcs. The nodes and arcs can be decorated with what neo4j calls ‘labels’ and ‘properties’.
Labels serve much the same purpose as RDF and OWL classes and predicates. A node may have any number of labels, each arc may have one. One of the main points about them is that one may create indexes (I get the impression that this is actually a legacy feature) on a label:property pair. You can index the ‘name’ property of every node with a ‘Taxon’ label. These indexes can be declared unique, which gives you a degree of data integrity checking (although with nothing like the rigour of an RDBMS).
Properties are simply primitive values – numbers, strings, etc. ‘Data properties’ in OWL-speak.
Property and label names are plain strings
Although labels and property names can be URIs, the Cypher language does not support this beyond allowing you to quote these kinds of identifiers (with back-quotes, of all things). It’s missing the ability to declare namespaces to be used as prefixes so far as I can see.
This means that either we put
all over the shop in the queries, or we bolt something over the top of it to supply the missing prefixes when we convert it to RDF. Or we don’t use dublin core.
Neo4j permits properties to be placed on arcs
While this is basically a great idea, it doesn’t translate into RDF. The way to do this in RDF would be to generate a rdfs:Statement object for each arc, and to attach the properties to that. This means that we require a translation layer (unless the bolt-ons on the web site do something like that).
A problem is that we would want to do this a lot – one of the things we need to do is to attach data to the arcs. Really, its a deficiency with the RDF model itself, but if we want to produce RDF at all then the question of ‘how do we present this data we have put on the arcs’ becomes a thing.
Another issue is that properties are only ever data, not arcs in themselves. One of the things we may want to do is to use a controlled vocabulary for certain properties. Enumerated types. The way we normally do this is to declare a set of URIs. We can certainly put these in strings as properties on arcs, but they wouldn’t link to nodes in the same way. In RDF, a URI is simply a URI. In SPARQL you can query for ‘nodes having a persistence type that is a shade of blue’, because ‘persistence type’ and ‘colour shade’ are nodes in their own right. But if we want arcs to have a ‘persistence type’, Neo4j just doesn’t work that way.
no quad store
We could simulate a quad store (to permit the SPARQL named graph construct) by adding a property to each node and arc. But again – there would need to be a layer added to translate this hack. Perhaps the SPARQL service built for Neo4j has provision for this.
The data store, and staging
Jena permits a ‘graph’ to be made up of bits stored in different directories on disk. For instance, in our service layer at present the AFD, APNI/APC, and CoL datasets are split into different files. As far as I can see, Neo4j simply doesn’t do this. Another thing that we can do in JENA is load the vocabulary files from RDF as static data. Neo4j would require them to be converted.
I’m not sure how we would both have an editor app that updates the tree and also have a SPAQL service running against that same data, although this is a problem in both Neo4j and Jena/Joseki. We could
- run the data store as a separate process on a separate port and communicate over http
- build the core tree manipulation operations as a library module in Neo4j or joseki (communicating via RMI, perhaps)
- run neo4j or joseki inside the tree webapp. Doing this probably means we lose all the clustering and management functionality.
Neo4j does do transactions, but it does them by maintaining state in-memory. I’m not 100% confident about that, but then again: I’m not sure how JENA doe them.
Cypher is kinda cute. Nodes have parentheses, arcs have arrows suspiciously like the syntax in Graphviz .dot files, and filtering criteria uniformly have square brackets. It has features which I can’t recall as being in SPARQL, that is: it may be better than SPARQL.
The main thing is as stated: it’s not a standard, and they are still working on it. To be confident your code will continue to work, you need to add a cypher version command at the top of the file.
As I said: I don’t like it, don’t trust it, but maybe I’m just a stick-in-the-mud. The main issue is the mismatch between this and RDF.