Descending, again, into the maze of documentation that is – well – anything to do with bioinformatics. (as the joke goes: standards are great! So many to chose from!)

We are attempting to build conformant darwin-core archive files. The files we have do validate sucessfully

But we’d like ’em to be better. So. “Darwin Core Achive format, Reference Guide to the XML Descriptor File”. Page 7. “GBIF recommends a GBIF metadata profile1“. Broken link.

Ok, I found this: eml.xsd (1.0.1) and it appears to be something approximating the correct sort of thing. It includes eml-gbif-profile.xsd. Outstanding!

the xml element must have a packageId, scope, and system. packageId is just a unique id – I’ll jam the timestamp in there, job done. Scope is fixed to “system”. And what is system? At a guess – it’s the namespace in which the package ids are unique, so in our case it’s meant to be “darwincore taxonomic trees from”.

The profile.xsd does not have a namespace, but specifies that the elementFormDefault is “qualified”. Not sure what happens there. Will I need to explicitly define a namespace prefix for the empty namespace? Having trouble running xmllint owing to proxy nonsense. I need to set up a local XML catalog with xsd, xml, dc, dcterms and so on. So I am not 100% positive that my XML is correct against the schema, yet.

What else …

My, the schema sure does insist on bunch of stuff. You must have creator, metadata provider, and contact blocks – all of which are “agent” blocks, although you can get away with each having only one subelement (organisation name).

The intellectual rights block is just a chunk of free text – no support for including mixed data. It would be nice to have dcterms:license or even creative commons elements in there, but the schema does not support it.

Coverage is nice, but the taxonomic coverage element is odd: an optional generalTaxonomicCoverage, then any number of taxonomicClassification elements. Each taxonomicClassification element is taxonRankName, taxonRankValue, commonName. So it seems that there’s no way to say that your data covers regnum PLANTAE unless you put it in as a common name. Perhaps taxonRankValue was meant to be the taxon name, and the wires got crossed somewhere?

Fun times, anyway.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: