SKOS and classifications

April 29, 2015

“Oh, and we’d really like SKOS as well”.

Riiiight.

We have names and references.
Names appear in references at “instances” of a name.
Some name instances are taxonomic concepts.

We have a classification made up of tree nodes.
A tree node has many subnodes.
A tree node may be used by many supernodes as a subtree.
A tree node (for classification trees) refers to a taxonomic concept instance.

Now here’s the thing.

The basic job of SKOS in our data is to say “this concept is narrower-than that concept”. X ∈ A → X ∈ B. Naively, we’d put this over the node ids and say job done.

But our tree nodes are most certainly not taxonomic concepts. They are placements of concepts in some classification. It’s the concepts that are concepts. So the skos output must declare that it is the taxon concepts that are narrower-than other taxon concepts.

The problem becomes that the data becomes a mishmash. If a concept is moved from one place to another, then our classification data (which holds the entire history of a tree) will declare in one node that A⊂B and in another that that A⊂C, where B and C are different taxa at the same level.

Not good.

The problem becomes clearer when you consider having two entirely separate and incompatible classifications. It makes sense to say A⊂B according to the people at NSW University, but A⊂C according to the Norwegian National Classification of Herring. These classifications are sources of assertions that you can choose to trust or to not trust.

The nature of our classification nodes becomes clear. Each node is a document – a source of triples – that asserts certain facts and that trusts the assertion of facts in the nodes below it.

All I need to do is to find a suitable ontology for expressing this. The W3C provenance ontology looks promising. It even has – Oh My Lord! I has these predicates

prov:wasRevisionOf prov:invalidatedAtTime prove:wasInvalidatedBy

Which match exactly certain columns in my data.

(edit: why is this important? It indicates that the provenance ontology and my data may very well be talking about the same kinds of thing. If your data has a slot for “number of wheels” and “tonnage” and so does mine, there’s a good chance we are both discussing vehicles.)

I’m pretty comfortable, at this point, deciding that each tree node in a classification is a prov:Entity object. In fact, it seems to be a prov:Bundle in the sense of its subnodes, and a prov:Collection in the sense of the reified subnode axioms, which are skos terms.

I’m probably over-engineering things again. But there you go.


PS: not all subnode relationships are skos broader/narrower term relationships. In particular, hybrids. A hybrid cucumber/pumkin is not simply a kind of cucumber in the same way that a broadbean is a type of legume. I cannot simply assert skos inclusion over a tree – the actual link types and node types matter.


JENA, D2R, TDB, Joseki – happy together at last

March 3, 2015

Well!

The magic formula tuned out to be:

  1. Run joseki
  2. Using the d2rq libraries
  3. And just the two joseki jars containing the actual server
  4. Use a new version of TDB, which uses the new ARQ

And she’s up. Dayum. I have the static vocabulary files, the preloaded old APNI, AFD, and 2011 Catalogue of Life, and a live link to our (test) database, and you can write SPARQL queries over the lot. No worries. A trifle slow if you do stuff that spans the data sources.

Now the boss wants me to write some user doco for it. Item 1 is why this stuff matters (it matters because it’s our general query service). So rather than explaining it all here in my blog, I should do it on confluence.


Fish jelly, and linked data.

February 11, 2015
You can’t nail jelly to a tree. But you can put it in a bucket and slap a label on the bucket

Fish. Or, if you prefer, PISCES. http://biodiversity.org.au/name/PISCES, in fact.

PISCES (as xml) (as json)
URI http://biodiversity.org.au/afd.name/244465
LSID urn:lsid:biodiversity.org.au:afd.name:244465
modified 2013-10-17
Title PISCES
Complete PISCES
Nom. code Zoological
Uninomial PISCES
Taxonomic events
PISCES sensu AFD afd

Our new system has scientific names, with author, date, nomenclatural status – the whole thing. And, of course, these names all have URIs with id numbers. Which is ok from the POV of the semantic web, as URIs are opaque identifiers.

But, but wouldn’t it be nice if you actually could read the URI? Wouln’t it be nice to just use “Megalops cyprinoides” or “Herring” as an id? I mean – it’s that the whole point of naming things? The whole point of identifying things is that the name can be used as, well, an identifier. And indeed, it is used that way. All the time. When you go into the nursery and buy a rosebush, the label proudly displaying the scientific name as well as what most people actually call it, they don’t cite the author of the name. There’s no point saying “but without the author, it doesn’t mean anything” because it quite plainly and obviously does mean something. To lots of people.

So we support it. Eg http://biodiversity.org.au/name/Herring.

The main people to whom a name without author doesn’t mean anything are those people engaged in the work of managing names. The information-science aspect of keeping a nice clean set of identifiers available for the rest of the world to use. Taxonomists, and people at the pointy end of the job of working out what actually lives on this planet of ours. They’re the only ones who really care.

Thing is, our databases are built around that work – that’s the mission here. Within that space a bare scientific name actually is pretty much meaningless. But given that we do support these URIs, from a semweb perspective what does http://biodiversity.org.au/name/PISCES, or for that matter http://biodiversity.org.au/name/Herring mean, exactly? What is it the URI of? What, in a word, is its referent?

I think that the referent of http://biodiversity.org.au/name/Herring has got to be the word “Herring”. That’s all. Perhaps it’s even owl:sameAs text:Herring – if ‘text’ were a URI schema.

Our database does not have anything in it whose identifier is simply ‘Herring’. However, we do have a bunch of other stuff that might be of interest. In particular, we have a name whose URI is http://biodiversity.org.au/afd.name/246141. If you ask for Herring, which is an id that is not in our database, you will get that.

Is this a legit thing to do? The URI http://biodiversity.org.au/name/Herring is kind of in a limbo. On the one hand, the URI exists – you don’t get a 404. On the other hand, we don’t serve up any data pertaining to the URI that you have referenced. You ask for ‘Herring’. You get something else that isn’t Herring (one of its fields is the text value ‘Herring’, but what does that mean?). Now, you and I understand this, but a semantic web engine isn’t going to. To put it another way, over the entire graph hosted at our server, http://biodiversity.org.au/name/Herring sits on its own and isn’t connected to anything.

So. To address this, my plan is to create an object whose URI is http://biodiversity.org.au/name/Herring. It will have a type of – ooh, I dunno – boa-name:SimpleName, I’ll give it an rdf:value of ‘Herring’ and a number of predicates named – say – boa-name:isSimpleNameOf (or probably just do it in reverse: boa-name:hasSimpleName on the name object). There you go.

The situation is slightly different for http://biodiversity.org.au/taxon/Herring. The meaning of this is rather more specific. This URI is an alternative name – owl:sameAs – for the current APC or AFD concept for the valid or vernacular name whose simple name string is ‘Herring’. It’s the accepted taxon for the name at b.o.a .

That is, there is not a separate object. It’s an alternative URI for an object that we already host. And it may change over time: that’s the nature of semantic web data. Actually implementing this in d2r … I don’t know. I might need to build a few database views for the various ways this fact might be derived.

At least that’s the plan. If there are outright collisions of simple names in accepted taxa among our different domains, then, well – this will need to be re-thought a bit. It may very well be that http://biodiversity.org.au/taxon/Herring is not usable as an id, but that http://biodiversity.org.au/taxon/animalia/Herring might be.

In any event. The goal is to give these convenience URIs a home and maybe even make them mean something.


Neo4j – Don’t like it. I’ll try to explain why.

February 24, 2014

Having looked through the neo4j manual, I am not convinced that it is a good fit for what we are trying to accomplish.

  • The underlying model does not fit well into RDF.
    • It is not a triple store.
    • It is not based around URIs
    • It does not support SPARQL out of the box – it needs add-ons
  • It does not appear to support separate disk files for partitioning data
  • Cypher (the neo4j query language) is not a standard in the same way that SPAQRL is
  • Cypher is still being developed (although there is a mechanism for backward compatibility)

These problems can all be addressed, but they will require add-ons and work-arounds to do so.

The underlying model

Neo4j stores a graph of nodes and arcs. The nodes and arcs can be decorated with what neo4j calls ‘labels’ and ‘properties’.

Labels serve much the same purpose as RDF and OWL classes and predicates. A node may have any number of labels, each arc may have one. One of the main points about them is that one may create indexes (I get the impression that this is actually a legacy feature) on a label:property pair. You can index the ‘name’ property of every node with a ‘Taxon’ label. These indexes can be declared unique, which gives you a degree of data integrity checking (although with nothing like the rigour of an RDBMS).

Properties are simply primitive values – numbers, strings, etc. ‘Data properties’ in OWL-speak.

Problems are:

Property and label names are plain strings

Although labels and property names can be URIs, the Cypher language does not support this beyond allowing you to quote these kinds of identifiers (with back-quotes, of all things). It’s missing the ability to declare namespaces to be used as prefixes so far as I can see.

This means that either we put
`http://purl.org/dc/terms/title`
all over the shop in the queries, or we bolt something over the top of it to supply the missing prefixes when we convert it to RDF. Or we don’t use dublin core.

Neo4j permits properties to be placed on arcs

While this is basically a great idea, it doesn’t translate into RDF. The way to do this in RDF would be to generate a rdfs:Statement object for each arc, and to attach the properties to that. This means that we require a translation layer (unless the bolt-ons on the web site do something like that).

A problem is that we would want to do this a lot – one of the things we need to do is to attach data to the arcs. Really, its a deficiency with the RDF model itself, but if we want to produce RDF at all then the question of ‘how do we present this data we have put on the arcs’ becomes a thing.

Another issue is that properties are only ever data, not arcs in themselves. One of the things we may want to do is to use a controlled vocabulary for certain properties. Enumerated types. The way we normally do this is to declare a set of URIs. We can certainly put these in strings as properties on arcs, but they wouldn’t link to nodes in the same way. In RDF, a URI is simply a URI. In SPARQL you can query for ‘nodes having a persistence type that is a shade of blue’, because ‘persistence type’ and ‘colour shade’ are nodes in their own right. But if we want arcs to have a ‘persistence type’,  Neo4j just doesn’t work that way.

no quad store

We could simulate a quad store (to permit the SPARQL named graph construct) by adding a property to each node and arc. But again – there would need to be a layer added to translate this hack. Perhaps the SPARQL service built for Neo4j has provision for this.

The data store, and staging

Jena permits a ‘graph’ to be made up of bits stored in different directories on disk. For instance, in our service layer at present the AFD, APNI/APC, and CoL datasets are split into different files. As far as I can see, Neo4j simply doesn’t do this. Another thing that we can do in JENA is load the vocabulary files from RDF as static data. Neo4j would require them to be converted.

I’m not sure how we would both have an editor app that updates the tree and also have a SPAQL service running against that same data, although this is a problem in both Neo4j and Jena/Joseki. We could

  • run the data store as a separate process on a separate port and communicate over http
  • build the core tree manipulation operations as a library module in Neo4j or joseki (communicating via RMI, perhaps)
  • run neo4j or joseki inside the tree webapp. Doing this probably  means we lose all the clustering and management functionality.

Neo4j does do transactions, but it does them by maintaining state in-memory. I’m not 100% confident about that, but then again: I’m not sure how JENA doe them.

Cypher

Cypher is kinda cute. Nodes have parentheses, arcs have arrows suspiciously like the syntax in Graphviz .dot files, and filtering criteria uniformly have square brackets. It has features which I can’t recall as being in SPARQL, that is: it may be better than SPARQL.

The main thing is as stated: it’s not a standard, and they are still working on it. To be confident your code will continue to work, you need to add a cypher version command at the top of the file.

In Conclusion

As I said: I don’t like it, don’t trust it, but maybe I’m just a stick-in-the-mud. The main issue is the mismatch between this and RDF.


Federating data with JENA – Getting JENA going locally

July 29, 2012

Ok! First step is to get JENA/Joseki up and running. It seems that I am out of date – the current product is “Fuseki”. But Joseki works, and I do not curretly need the new features in Fuseki.

Download site is here.

Unpacking joseki (after downloading from the browser)
pmurray@Paul3:~$ mkdir SPARQL_DEMO
pmurray@Paul3:~$ cd SPARQL_DEMO/
pmurray@Paul3:~/SPARQL_DEMO$ unzip ~/Downloads/joseki-3.4.4.zip 
pmurray@Paul3:~/SPARQL_DEMO$ ls
Joseki-3.4.4

Ok! I am going to build a config file with most of the gear ripped out, and I will provide a static RDF file with a bit of sample data.

sample.rdf
<?xml version="1.0"?>

<!DOCTYPE rdf:RDF [
    <!ENTITY sample-ontology "urn:local:sample-ontology:" >
    <!ENTITY colour "urn:local:sample-ontology:colour:" >
    <!ENTITY thing "urn:local:sample-ontology:thing:" >
    <!ENTITY owl "http://www.w3.org/2002/07/owl#" >
    <!ENTITY xsd "http://www.w3.org/2001/XMLSchema#" >
    <!ENTITY rdfs "http://www.w3.org/2000/01/rdf-schema#" >
    <!ENTITY rdf "http://www.w3.org/1999/02/22-rdf-syntax-ns#" >
]>

<rdf:RDF 
    xmlns="urn:local:sample-ontology:"
     xml:base="urn:local:sample-ontology"
     xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
     xmlns:sample-ontology="urn:local:sample-ontology:"
     xmlns:owl="http://www.w3.org/2002/07/owl#"
     xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

    <owl:Ontology rdf:about=""/>
    
    <owl:Class rdf:about="&sample-ontology;Colour"/>
    <owl:Class rdf:about="&sample-ontology;ColouredThing"/>

    <owl:ObjectProperty rdf:about="&sample-ontology;hasColour">
        <rdfs:range rdf:resource="&sample-ontology;Colour"/>
        <rdfs:domain rdf:resource="&sample-ontology;ColouredThing"/>
    </owl:ObjectProperty>
    
    <Colour rdf:about="&colour;RED"/>
    <Colour rdf:about="&colour;ORANGE"/>
    <Colour rdf:about="&colour;YELLOW"/>
    <Colour rdf:about="&colour;GREEN"/>
    <Colour rdf:about="&colour;BLUE"/>
    <Colour rdf:about="&colour;INDIGO"/>
    <Colour rdf:about="&colour;PURPLE"/>

    <ColouredThing rdf:about="&thing;GREENBALL">
        <hasColour rdf:resource="&colour;GREEN"/>
    </ColouredThing>

    <ColouredThing rdf:about="&thing;REDBALL">
        <hasColour rdf:resource="&colour;RED"/>
    </ColouredThing>
    
</rdf:RDF>

Ok! And we need a very, very basic config file. It’s a bit sad that this counts as “basic”, but there’s not a lot of way around it:

joseki.ttl
@prefix rdfs:   <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd:    <http://www.w3.org/2001/XMLSchema#> .

@prefix module: <http://joseki.org/2003/06/module#> .
@prefix joseki: <http://joseki.org/2005/06/configuration#> .
@prefix ja:     <http://jena.hpl.hp.com/2005/11/Assembler#> .

@prefix : <urn:local:joseki:config:> .

@prefix graph: <urn:local:graph:> .

[]
  rdf:type joseki:Server;
  joseki:serverDebug "true".

ja:MemoryModel rdfs:subClassOf ja:Model .
ja:UnionModel rdfs:subClassOf ja:Model .

:sample_vocabulary 
  a ja:MemoryModel ;
  ja:content [
    ja:externalContent <file:sample.rdf> 
  ] .

:empty_graph a ja:MemoryModel .

:dataset a ja:RDFDataset ;
  ja:defaultGraph :empty_graph ;
  ja:namedGraph [ 
    ja:graphName graph:sample ; 
    ja:graph :sample_vocabulary  
  ] .

:sparql_service
  rdf:type joseki:Service ;
  rdfs:label "SPARQL-SDB";
  joseki:serviceRef "sparql/";
  joseki:dataset :dataset;
  joseki:processor [
    rdfs:label "SPARQL processor" ;
    rdf:type joseki:Processor ;
    module:implementation [  
      rdf:type joseki:ServiceImpl;
      module:className <java:org.joseki.processors.SPARQL>
    ] ;
    joseki:allowExplicitDataset "false"^^xsd:boolean ;
    joseki:allowWebLoading "false"^^xsd:boolean ;
    joseki:lockingPolicy  joseki:lockingPolicyMRSW
  ] .

Great! Now we need to actually start the server with the config file that we have provided:

joseki.sh
#!/bin/bash
DD=$(pwd)
export JOSEKIROOT=$DD/Joseki-3.4.4
pushd $JOSEKIROOT
$JOSEKIROOT/bin/rdfserver --port 8081 $DD/joseki_config.ttl
popd

Do please note that the joseki service needs to be running to make the urls work. I mention it in the spirit of “please check that your computer is plugged in”.

Starting the sparql server
pmurray@Paul3:~/SPARQL_DEMO$ ./joseki.sh

And the server starts perfectly fine. At this point, I should be able to navigate to http://localhost:8081/sparql/ (note the slash at the end).

It works fine – joseki correctly complains that I have not given it a query string. So lets give it one!

http://localhost:8081/sparql/?output=text&query=select * where { graph ?g { ?s ?p ?o } }

Now I want a better web interface than typing SPARQL into a command line, so I will use this from now on:

sparql.html
<html>
  <body>
      <form action="http://localhost:8081/sparql/" method="post" target="SPARQLOUTPUT">
	  <textarea style="background-color: #F0F0F0;" name="query" cols="70" rows="27">
select ?g ?s ?p ?o
where { 
  graph ?g { 
    ?s ?p ?o
  }
}
ORDER BY ?g ?s ?p ?o
          </textarea>
	  <br>
	  <input type="radio" name="output" value="xml"> xml, 
	  <input type="radio" name="output" value="json"> json,
	  <input type="radio" name="output" value="text"> text,
	  <input type="radio" name="output" value="csv"> csv,
	  <input type="radio" name="output" value="tsv" checked> tsv<br>
          Force <tt>text/plain</tt>: <input type="checkbox" name="force-accept" value="text/plain"><br>
	  <input type="submit" value="Get Results" >
      </form>
  </body>
</html>

And that does the job. Click “force plain” to stop your prowser from downloading the output as a file.

TopNext


Federating data with JENA

July 29, 2012

I am going to attempt here to bring it all together and make some magic happen with SPARQL and RDF. My goal is to run a local and largely blank instance of JENA which fetches data from heterogeneous data sources, and applies reasoning rules over the top.

The goal is to demonstrate that rdf can be useful even without global, worldwide agreement on vocabulary and ontology. The key to making this work is not getting everyone to agree on terms and what they mean by terms, but to get everyone to clearly state what terms they use and what they they mean by them. Hopefully, the subject matter itself means that the meanings are pretty much compatible.

Speaking of meanings: before I continue, I’d like to apologise in advance for my inevitable solecisms. I’m a computing person, not a biologist or taxonomist.

Step 1: Getting JENA going locally

Step 2: Linking the local JENA to more than one external SPARQL service

Step 3: Using OWL to translate the foreign data into a common local vocabulary

Step 4: running a query.


SPARQL

February 16, 2012

Success!

Rather than use an oracle database as the back-end store for our sparql service, I am using TDB: a back end that comes with JENA. It’s just some files in a local directory.

This means that the boa content has to go on the web server, but that’s ok. it’s about 42 gig.

And my God it’s fast. Even checking a regex against all names is … tolerable. Several seconds.

Must add search fields for the names and authors – converted to uppercase, all diacritics removed. Perhaps even add taxamatch conversion.

Also need to write some demo pages and host them on biodiversity.org.au . My demo pages use AJAX, which creates cross-site scripting issues if the are not hosted there.


Sparql at bodiversity.org.au – part 3

December 9, 2011

Well, I have uploaded the sample HTML page to NSL_SPARQL.html at google code.

You need to save it as a HTML file, not as a .txt file.

This page contains some javascript that presents a few more options than the simple form in my other posts. In particular, it will convert your query into a hyperlink. The “view as table” option is also convenient, as it does not switch you away to another window.

The error handling – could be better. But this html form is the tool that I have been using to explore sparql.


Sparql at biodiversity.org.au, pt 2

December 8, 2011

When last we spoke – this morning, in fact – I was having trouble getting a simple string match going.

After some investigation – our DBA asking Oracle to dob on what joseki is doing – it’s… damn weird.

We have different SQL being generated for the FILTER query and the =”…” query. The filter one, as I’d expect, does a select * from nodes. It’s surprising that its as fast as it is. But so does the =”…” one.

Joseki seems to generate all the right SQL, but it’s commented out, and all that’s left uncommented is “select * from nodes” … no – I was totally wrong about that. Disregard everything I just said about Joseki not converting queries into SQL. The SQL is good, I was just not seeing the line breaks at the end of the commented out bits.

The hash that joseki is generating for the string “Abacopteris aspera” does not match the hash for that string in the database. We are using SDB with index2, and that means that each distinct value is hashed and the has indexed – that’s how it deals with different data types.

The bit that seems to matter from the query is

-- Const: <http://biodiversity.org.au/voc/graph/DATASET#20111129-APNI_TAX_NAM&gt;
INNER JOIN Nodes N_2
-- Const: <http://rs.tdwg.org/ontology/voc/TaxonName#nameComplete&gt;
ON ( N_1.hash = -707528504822182151 AND N_2.hash = 696684872646940002 ) INNER JOIN Nodes N_3
-- Const: "Abacopteris aspera"
ON ( N_3.hash = -6114722394035499839 )
INNER JOIN Quads Q_1
-- <http://biodiversity.org.au/voc/graph/DATASET#20111129-APNI_TAX_NAM&gt; :2s <http://rs.tdwg.org/ontology/voc/TaxonName#nameComplete&gt; "Abacopteris aspera"
ON ( Q_1.g = N_1.id
-- Const condition: <http://biodiversity.org.au/voc/graph/DATASET#20111129-APNI_TAX_NAM&gt;
AND Q_1.p = N_2.id
-- Const condition: <http://rs.tdwg.org/ontology/voc/TaxonName#nameComplete&gt;
AND Q_1.o = N_3.id
-- Const condition: "Abacopteris aspera" )
LEFT OUTER JOIN Nodes R_1
-- Var: :3s
ON ( Q_1.s = R_1.id )

Now, pulling out all the hash values from that and querying against the oracle data tables:

HASH TO_CHAR(LEX)
-707528504822182151 http://biodiversity.org.au/voc/graph/DATASET#20111129-APNI_TAX_NAM
696684872646940002 http://rs.tdwg.org/ontology/voc/TaxonName#nameComplete

As you see, the hash values for the URIs are correctly computed. But the has value for the string – according to the value in the data table, it should be 6576901907426019494, which is nowhere to be seen.

Hmm. What’s that in hex, I wonder? Hash in the database: 5B45D9705AB788A6, hash in the query: AB2424613B634CC1. Nope – no luck there. Nothing to do with each other.

So: why is the query engine computing a different hash value for a constant string than the SDB loader generated when it loaded it?

I hacked up Joseki by recompiling one of the classes and adding debugging. There’s a method Nodelayout2.hash(String lex, String lang, String datatype, int type). It gives me the SDB hash when passed
Abacopteris aspera, null, http://www.w3.org/2001/XMLSchema#string, 4
and the JOSEKI has when passed
Abacopteris aspera, null, null, 3

So I’m guessing that type 3 is “untyped literal” and type 4 is “typed literal”. …

Ok. Types are in Enumeration ValueType. 3 and 4 are STRING and XSDSTRING, respectively, which makes perfect sense.

Can I get JOSEKI to covert my literal into an XSDSTRING?

select ?lbl ?pred ?value
where { 
  graph g:APNI_TAX_NAM { 
    <http://biodiversity.org.au/apni.name/277356>
     tn:nameComplete 
     "Abacopteris aspera"^^<http://www.w3.org/2001/XMLSchema#string> .
  }
}

Drat. No. But here’s the intriguing thing …

Ah ha! It’s not intriguing at all! I’m an idiot! I just wasted half a day puzzled over this! I do indeed get one row back, but because none of the variables are bound, my results page shows a table row that’s only a couple of pixels high! If I hadn’t coloured the rows, I’d have seen nothing at all!

Well … that’s awesome. It means that this should work:

select ?taxNamUri
where { 
  graph g:APNI_TAX_NAM { 
    ?taxNamUri
     tn:nameComplete 
     "Abacopteris aspera"^^<http://www.w3.org/2001/XMLSchema#string> .
  }
}

And not only does it work, it comes back really fast. Hmm. Now, that’s running agains an instance of joseki running on my machine, which I have hacked up for the occasion. What about running it agains the one at BOA? …

Oh my Lord! It’s awesome! Let’s try using a prefix for the xml schema namespace. …

Yep, that’s good too. Now then: lets combine two different names into a single subgraph. This is important, because I am aiming at being able to submit a list of names:

select ?taxNamUri
where { 
  graph g:APNI_TAX_NAM { 
    {?taxNamUri tn:nameComplete "Abacopteris aspera"^^xs:string .}
    UNION
    {?taxNamUri tn:nameComplete "Abacopteris presliana"^^xs:string .}
    UNION
    {?taxNamUri tn:nameComplete "Abacopteris triphylla"^^xs:string .}
  }

And finally, I should be able to hook that up to my “branch” graph (its a long story) to get the accepted taxon, and the “taxon” graph to get the full title of that taxon.

select ?taxNamUri ?name  ?acceptedTax ?accTaxTitle
where { 
  graph g:APNI_TAX_NAM { 
    {
      {?taxNamUri tn:nameComplete "Abacopteris aspera"^^xs:string .}
      UNION
      {?taxNamUri tn:nameComplete "Abacopteris presliana"^^xs:string .}
      UNION
      {?taxNamUri tn:nameComplete "Abacopteris triphylla"^^xs:string .}
    }
    ?taxNamUri tn:nameComplete ?name
  }
  OPTIONAL {
    graph g:APC_TREE {
      ?acceptedTax ibis:isAcceptedConceptFor ?taxNamUri .
    }
    graph g:APNI_TAX_CON {
      ?acceptedTax dcterms:title ?accTaxTitle
    }
  }
}

(I’ll get rid of the URI columns)

?name ?accTaxTitle
Abacopteris aspera Pronephrium asperum (C.Presl) Holttum [CHAH 2006]
Abacopteris presliana
Abacopteris triphylla Pronephrium triphyllum (Sw.) Holttum sensu Bostock, P.D. (1998)

And that’s it. Well, the SPARQL, anyway. I just need to write a little bit if HTML and javascript allowing a user to paste a list of names in a HTML form, and done.


Sparql at biodiversity.org.au

December 8, 2011

Well! Quite a bit of success with our test deployment of a SPARQL server. The server runs at http://biodiversity.org.au/sparql/. I have a very nice html page that uses the JSON output, but WordPress won’t let me upload it. Oh well. And, of course, our wiki is locked up.

Oh well. Here’s code for a simple HTML form. WordPress clips it, but that’s just display: you can still c/p it into a html file.

<html>
  <body>
    <form action="http://biodiversity.org.au/sparql/" 
        method="post" target="SPARQLOUTPUT">
      <textarea style="background-color: #F0F0F0;" 
          name="query" cols="70" rows="30">
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix owl: <http://www.w3.org/2002/07/owl#>
prefix dcterms: <http://purl.org/dc/terms/>
prefix tn: <http://rs.tdwg.org/ontology/voc/TaxonName#>
prefix tc: <http://rs.tdwg.org/ontology/voc/TaxonConcept#>
prefix pc: <http://rs.tdwg.org/ontology/voc/PublicationCitation#>
prefix tcomm: <http://rs.tdwg.org/ontology/voc/Common#>
prefix ibis: <http://biodiversity.org.au/voc/ibis/IBIS#>
prefix afd: <http://biodiversity.org.au/voc/afd/AFD#>
prefix apni: <http://biodiversity.org.au/voc/apni/APNI#>
prefix apc: <http://biodiversity.org.au/voc/apc/APC#>
prefix afdp: <http://biodiversity.org.au/voc/afd/profile#>
prefix apnip: <http://biodiversity.org.au/voc/apni/profile#>
prefix g: <http://biodiversity.org.au/voc/graph/GRAPH#>

select ?label ?title ?title ?desc
  where {
    graph g:meta {
      ?uri rdf:type g:GraphURI .
      OPTIONAL { ?uri rdfs:label ?label  } .
      OPTIONAL { ?uri dcterms:title ?title  } .
      OPTIONAL { ?uri dcterms:description ?desc  } .
    }
  }
ORDER BY ?uri
      </textarea>
      <br>
      <input type="radio" name="output" value="xml"> xml,
      <input type="radio" name="output" value="json"> json,
      <input type="radio" name="output" value="text"> text,
      <input type="radio" name="output" value="csv"> csv,
      <input type="radio" name="output" value="tsv" checked> tsv<br>
      Force <tt>text/plain</tt>: <input 
          type="checkbox" name="force-accept"   value="text/plain"><br>
      <input type="submit" value="Get Results" >
    </form>
  </body>
</html>

The form contains a little sample query.

A big problem is metadata, which involves question like

  • What named graphs does the sparql service expose?
  • What vocabularies are used?
  • What publically-visible identifiers/top-level objects are available?

I’ve made a bit of an attempt at making this self-documenting by having the “meta” and “ibis_voc” graphs, containing the graphs and the vocabulary. But it’s hard going interpreting OWL, which is what the vocabulary documents are, and the refs:comment entries for the local vocabularies are not always well-written. Sigh: no matter how clever you try to be with tools and structure, ultimately you have to sit down and write the content.

So: What classes and predicates are defined in our custom ibis vocabulary – in addition to the TDWG standards? (my sample code assumes that you have left the prefix declarations as they are in the sample HTML form)

select ?pred 
where { 
  graph g:ibis_local_voc { 
    ?pred rdf:type owl:ObjectProperty .
  }
}
ORDER BY ?pred

Rad. Do any of them had domains and ranges defined?

select ?domain ?pred ?range 
where { 
  graph g:ibis_local_voc { 
    ?pred rdf:type owl:ObjectProperty .
    OPTIONAL { ?pred rdfs:domain ?domain } .
    OPTIONAL { ?pred rdfs:range ?range } .
  }
}
ORDER BY ?domain ?pred

System.out.println(
new String[]{“Booyah”,”Awesome”,”Boss”,”Amazing”}
[new Random().getInteger()%4] + “!”);

Ok. But what about the data? Well, our identifiers are biodiversity.org.au URIs (a rather important nugget of info, that), in order to “play nice” with the semantic web.

select ?pred ?value
where { 
  graph g:APNI_TAX_NAM { 
    <http://biodiversity.org.au/apni.name/277356>
        ?pred ?value .
  }
}
ORDER BY ?pred

You know, I’d really like to se the rdfs:labels for those predicates rather than the URLs

select ?lbl ?value
where { 
  graph g:APNI_TAX_NAM { 
    <http://biodiversity.org.au/apni.name/277356>
        ?pred ?value .
  }
  OPTIONAL {
    graph g:ibis_voc {
      ?pred rdfs:label ?lbl
    }
  }
}
ORDER BY ?pred

And we see that I haven’t done rdfs labels for some of them.

Of course, the really big thing is to search by name. And here we hit a snag. Let’s get specific about http://biodiversity.org.au/apni.name/277356 .

select ?lbl ?pred ?value
where { 
  graph g:APNI_TAX_NAM { 
    <http://biodiversity.org.au/apni.name/277356>
     tn:nameComplete 
     ?value .
  }
}

As you see, it definitely has a name and that name is Abacopteris aspera. So that means this should work. It should pull out a single row with none of the variables bound:

select ?lbl ?pred ?value
where { 
  graph g:APNI_TAX_NAM { 
    <http://biodiversity.org.au/apni.name/277356>
     tn:nameComplete 
     "Abacopteris aspera" .
  }
}

But no – no rows. This, however:

select ?nameComplete
where { 
  graph g:APNI_TAX_NAM { 
    <http://biodiversity.org.au/apni.name/277356>
     tn:nameComplete ?nameComplete .
    FILTER(?nameComplete = "Abacopteris aspera") .
  }
}

Works fine. So … to find the URIs with a name complete of “Abacopteris aspera”, we just use a filter, right? Not so fast! It takes a while to run. It’s pretty obvious that it’s not hitting the index.

(Continued …)