New features

January 16, 2013

The think is, the users don’t care about the same things that you care about.

I just added a feature to AFD that lists – for a publication – the places where that publication is used. Its references, what names those references are used in, its child publications.

For me, this is “no worries” material. It’s just a static page. But for the users it’s miraculous. Every time someone sends mail saying “this publication is cited wrongly”, it can’t be fixed until the team confirms that it isn’t used somewhere where the way it is cited is correct for that usage. Previously, this involved database queries done by their tech person and typing in the ids into a browser window. Now it’s just clickety-click. Going to save them hours and hours of time, it seems.

Of more interest to me is my algorithm for managing duplicate publications. The users were very keen that if you mark a record as being a duplicate of some other record, you be able to undo it. For my part, I was keen that I not have to rewrite every query in the system to accommodate this (the publication_id is X, but X had been replaced by Y so we use that).

To manage this, I added a field “original publication_id” to the tables. When a publication is marked as a duplicate, then the publication_id pointers to it are updated. You can make chains and trees of duplicates if you want. To undo, we find all publications that were merged into the one we are unmarking (a small tree walk using a recursive with query), and fix up anything whose “original publication_id” is in that set.

It works a treat. Simple, fast, doesn’t disturb the rest of the system – all that good stuff.

Nearly finished fixing things in AFD, I think. Main important thing remaining is fixing “replaced by” taxa where a record is created from another record which is never made public – it breaks the chain of provenance, because we never expose that middle link.


Publication Citations and open standards

January 16, 2013

I suppose this post is just to let that minuscule number of people who follow this blog know about the later add-on to the Australian Faunal Directory. It’s invisible, you see, unless you have a browser plug-in.

I jammed open urls in after all of the publication citations. These are present as a COinS span. These are sensed by a Firefox browser plugin.

I suppose the amazing thing is that they work pretty much as advertised. When you view an AFD page with the plug-in active, you can click the icon (which is configurable, thank God) and be taken to the world cat aggregator, which will cheerfully list the libraries closes to you that list the publication. So, from the AFD page to looking at a physical copy, it’s only a few clicks away. The CSIRO libraries and the various libraries at Australian universities feed data to WorldCat.

I’d like to get feedback from a Mendeley user. I came across this:

As one of the many requested features from our feedback page, Mendeley Web now supports COinS (ContextObjects in Spans). COinS is an open and easy to use specification for publishing OpenURL bibliographic metadata in HTML. On web pages, embedded COinS can be read and processed by applications.

Also Mendeley’s Web Importer can now identify COinS embedded on other websites. This information can then be easily imported into your Mendeley research library. In addition, a COinS section is embedded in each of our article pages. This means that other bookmarklets can extract and process the information from Mendeley’s article pages.

I imagine that this means that you can now point Mendeley at, say, Hydroptila acinacis and it will pull out the references. It’d be nice to hear if it works.

Actually – I’ll twitter it.


Warm and Fuzzy.

October 13, 2012

A while back, I wrote a utility to pull in an XML schema and spit out a graphviz “.dot” file. I got permission from the boss to put it on sourceforge and in the publc domain.

Just recently, I got some mail from someone wanting to use it. After a message back and forth, today I got this:

What an hoped for, but unexpected pleasure to hear from you!!

When I sent the original e-mail I had next to 0 hope of hearing from you. e-mail addresses die, people lose interest, what not. So to see your replies is a really, really nice way to start my day.

I’ve ended up hacking Grapher.java and adding the URIs that I wanted to ignore. I tried using ant and it wanted to source zip files from the Internet that weren’t there any more, so I did it the brute force way. I found jars and zips that would satisfy the compiler, updated a Makefile and iterated in that loop for a while. Finally got the thing to compile and execute. Life is good.

I understand your desire when you created the program. It is much easier for me to grasp something if I can see a picture of whatever it is. So I wanted to see how the ~20 xsd files that I had to document were related. I wanted a picture. Your software answered my needs.

After changing Graph.java to exclude the things I wasn’t interested in (because they were outside our control), I was able to use the dot images to point out specific areas where the xsd files could be improved and clarified. There were things like circular includes, unused nodes, incorrectly annotated nodes and miscellaneous other things. Nothing that would prevent the xsds from working, but things that can drive people crazy.

Your software/tool/visualizer solved a presentation and interpretation problem for me.

Thank you for creating it and putting it in the public domain.

Damn!

I blackmailed him just now – asking him if he could add the things that he needed to do to compile it back into the sourceforge project :) .

Good stuff. Kudos to the graphviz project, which does the hard part of the job.


Federating data with JENA – Getting JENA going locally

July 29, 2012

Ok! First step is to get JENA/Joseki up and running. It seems that I am out of date – the current product is “Fuseki”. But Joseki works, and I do not curretly need the new features in Fuseki.

Download site is here.

Unpacking joseki (after downloading from the browser)
pmurray@Paul3:~$ mkdir SPARQL_DEMO
pmurray@Paul3:~$ cd SPARQL_DEMO/
pmurray@Paul3:~/SPARQL_DEMO$ unzip ~/Downloads/joseki-3.4.4.zip 
pmurray@Paul3:~/SPARQL_DEMO$ ls
Joseki-3.4.4

Ok! I am going to build a config file with most of the gear ripped out, and I will provide a static RDF file with a bit of sample data.

sample.rdf
<?xml version="1.0"?>

<!DOCTYPE rdf:RDF [
    <!ENTITY sample-ontology "urn:local:sample-ontology:" >
    <!ENTITY colour "urn:local:sample-ontology:colour:" >
    <!ENTITY thing "urn:local:sample-ontology:thing:" >
    <!ENTITY owl "http://www.w3.org/2002/07/owl#" >
    <!ENTITY xsd "http://www.w3.org/2001/XMLSchema#" >
    <!ENTITY rdfs "http://www.w3.org/2000/01/rdf-schema#" >
    <!ENTITY rdf "http://www.w3.org/1999/02/22-rdf-syntax-ns#" >
]>

<rdf:RDF 
    xmlns="urn:local:sample-ontology:"
     xml:base="urn:local:sample-ontology"
     xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
     xmlns:sample-ontology="urn:local:sample-ontology:"
     xmlns:owl="http://www.w3.org/2002/07/owl#"
     xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

    <owl:Ontology rdf:about=""/>
    
    <owl:Class rdf:about="&sample-ontology;Colour"/>
    <owl:Class rdf:about="&sample-ontology;ColouredThing"/>

    <owl:ObjectProperty rdf:about="&sample-ontology;hasColour">
        <rdfs:range rdf:resource="&sample-ontology;Colour"/>
        <rdfs:domain rdf:resource="&sample-ontology;ColouredThing"/>
    </owl:ObjectProperty>
    
    <Colour rdf:about="&colour;RED"/>
    <Colour rdf:about="&colour;ORANGE"/>
    <Colour rdf:about="&colour;YELLOW"/>
    <Colour rdf:about="&colour;GREEN"/>
    <Colour rdf:about="&colour;BLUE"/>
    <Colour rdf:about="&colour;INDIGO"/>
    <Colour rdf:about="&colour;PURPLE"/>

    <ColouredThing rdf:about="&thing;GREENBALL">
        <hasColour rdf:resource="&colour;GREEN"/>
    </ColouredThing>

    <ColouredThing rdf:about="&thing;REDBALL">
        <hasColour rdf:resource="&colour;RED"/>
    </ColouredThing>
    
</rdf:RDF>

Ok! And we need a very, very basic config file. It’s a bit sad that this counts as “basic”, but there’s not a lot of way around it:

joseki.ttl
@prefix rdfs:   <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd:    <http://www.w3.org/2001/XMLSchema#> .

@prefix module: <http://joseki.org/2003/06/module#> .
@prefix joseki: <http://joseki.org/2005/06/configuration#> .
@prefix ja:     <http://jena.hpl.hp.com/2005/11/Assembler#> .

@prefix : <urn:local:joseki:config:> .

@prefix graph: <urn:local:graph:> .

[]
  rdf:type joseki:Server;
  joseki:serverDebug "true".

ja:MemoryModel rdfs:subClassOf ja:Model .
ja:UnionModel rdfs:subClassOf ja:Model .

:sample_vocabulary 
  a ja:MemoryModel ;
  ja:content [
    ja:externalContent <file:sample.rdf> 
  ] .

:empty_graph a ja:MemoryModel .

:dataset a ja:RDFDataset ;
  ja:defaultGraph :empty_graph ;
  ja:namedGraph [ 
    ja:graphName graph:sample ; 
    ja:graph :sample_vocabulary  
  ] .

:sparql_service
  rdf:type joseki:Service ;
  rdfs:label "SPARQL-SDB";
  joseki:serviceRef "sparql/";
  joseki:dataset :dataset;
  joseki:processor [
    rdfs:label "SPARQL processor" ;
    rdf:type joseki:Processor ;
    module:implementation [  
      rdf:type joseki:ServiceImpl;
      module:className <java:org.joseki.processors.SPARQL>
    ] ;
    joseki:allowExplicitDataset "false"^^xsd:boolean ;
    joseki:allowWebLoading "false"^^xsd:boolean ;
    joseki:lockingPolicy  joseki:lockingPolicyMRSW
  ] .

Great! Now we need to actually start the server with the config file that we have provided:

joseki.sh
#!/bin/bash
DD=$(pwd)
export JOSEKIROOT=$DD/Joseki-3.4.4
pushd $JOSEKIROOT
$JOSEKIROOT/bin/rdfserver --port 8081 $DD/joseki_config.ttl
popd

Do please note that the joseki service needs to be running to make the urls work. I mention it in the spirit of “please check that your computer is plugged in”.

Starting the sparql server
pmurray@Paul3:~/SPARQL_DEMO$ ./joseki.sh

And the server starts perfectly fine. At this point, I should be able to navigate to http://localhost:8081/sparql/ (note the slash at the end).

It works fine – joseki correctly complains that I have not given it a query string. So lets give it one!

http://localhost:8081/sparql/?output=text&query=select * where { graph ?g { ?s ?p ?o } }

Now I want a better web interface than typing SPARQL into a command line, so I will use this from now on:

sparql.html
<html>
  <body>
      <form action="http://localhost:8081/sparql/" method="post" target="SPARQLOUTPUT">
	  <textarea style="background-color: #F0F0F0;" name="query" cols="70" rows="27">
select ?g ?s ?p ?o
where { 
  graph ?g { 
    ?s ?p ?o
  }
}
ORDER BY ?g ?s ?p ?o
          </textarea>
	  <br>
	  <input type="radio" name="output" value="xml"> xml, 
	  <input type="radio" name="output" value="json"> json,
	  <input type="radio" name="output" value="text"> text,
	  <input type="radio" name="output" value="csv"> csv,
	  <input type="radio" name="output" value="tsv" checked> tsv<br>
          Force <tt>text/plain</tt>: <input type="checkbox" name="force-accept" value="text/plain"><br>
	  <input type="submit" value="Get Results" >
      </form>
  </body>
</html>

And that does the job. Click “force plain” to stop your prowser from downloading the output as a file.

TopNext


Federating data with JENA

July 29, 2012

I am going to attempt here to bring it all together and make some magic happen with SPARQL and RDF. My goal is to run a local and largely blank instance of JENA which fetches data from heterogeneous data sources, and applies reasoning rules over the top.

The goal is to demonstrate that rdf can be useful even without global, worldwide agreement on vocabulary and ontology. The key to making this work is not getting everyone to agree on terms and what they mean by terms, but to get everyone to clearly state what terms they use and what they they mean by them. Hopefully, the subject matter itself means that the meanings are pretty much compatible.

Speaking of meanings: before I continue, I’d like to apologise in advance for my inevitable solecisms. I’m a computing person, not a biologist or taxonomist.

Step 1: Getting JENA going locally

Step 2: Linking the local JENA to more than one external SPARQL service

Step 3: Using OWL to translate the foreign data into a common local vocabulary

Step 4: running a query.


AGW Entails Nationlization

July 29, 2012

re: Global Warming’s Terrifying New Math.

Every cubic meter of coal, every liter of oil or gas, every mole of hydrocarbon taken out of the ground will get burned and added to the atmosphere. Sooner or later, by whatever chain of events, all of it will be burned for energy.

This means that if climate change is to be halted, no matter how it is done, the end result – the penultimate effect of the cause-and-effect chain – will be that fossil fuels are left in the ground. No matter how it is done, no matter by what framework or what means, no matter how indirectly, whether it is made illegal or simply uneconomic, the physical end result must be that digging up the coal largely ceases.

This simple observation means that all green initiatives by the fossil fuel industry are sham. If AGW is substantially checked, it will necessarily put them out of business.

The only way is to take these reserves out of profit-taking hands, by which I mean that they need (in the first instance) to be nationalized. The legal framework is simple, as a nation owns its mineral reserves and licenses out permission to mine them.

The next problem is getting the nation-states to stop mining the stuff, burning it and selling it to one another. But nationalization is a key first step.

I am pessimistic. This next generation will watch the world burn.


GBIF

July 23, 2012

Descending, again, into the maze of documentation that is – well – anything to do with bioinformatics. (as the joke goes: standards are great! So many to chose from!)

We are attempting to build conformant darwin-core archive files. The files we have do validate sucessfully

But we’d like ‘em to be better. So. “Darwin Core Achive format, Reference Guide to the XML Descriptor File”. Page 7. “GBIF recommends a GBIF metadata profile1“. Broken link.

Ok, I found this: eml.xsd (1.0.1) and it appears to be something approximating the correct sort of thing. It includes eml-gbif-profile.xsd. Outstanding!

the xml element must have a packageId, scope, and system. packageId is just a unique id – I’ll jam the timestamp in there, job done. Scope is fixed to “system”. And what is system? At a guess – it’s the namespace in which the package ids are unique, so in our case it’s meant to be “darwincore taxonomic trees from biodiversity.org.au”.

The profile.xsd does not have a namespace, but specifies that the elementFormDefault is “qualified”. Not sure what happens there. Will I need to explicitly define a namespace prefix for the empty namespace? Having trouble running xmllint owing to proxy nonsense. I need to set up a local XML catalog with xsd, xml, dc, dcterms and so on. So I am not 100% positive that my XML is correct against the schema, yet.

What else …

My, the schema sure does insist on bunch of stuff. You must have creator, metadata provider, and contact blocks – all of which are “agent” blocks, although you can get away with each having only one subelement (organisation name).

The intellectual rights block is just a chunk of free text – no support for including mixed data. It would be nice to have dcterms:license or even creative commons elements in there, but the schema does not support it.

Coverage is nice, but the taxonomic coverage element is odd: an optional generalTaxonomicCoverage, then any number of taxonomicClassification elements. Each taxonomicClassification element is taxonRankName, taxonRankValue, commonName. So it seems that there’s no way to say that your data covers regnum PLANTAE unless you put it in as a common name. Perhaps taxonRankValue was meant to be the taxon name, and the wires got crossed somewhere?

Fun times, anyway.


Follow

Get every new post delivered to your Inbox.

Join 55 other followers