Embedding semantic web statements in PDF documents

March 7, 2017

PDF documents contain a “Document Metadata” section, permitting a subset of RDF to be embedded in a PDF document. However, the subject of each triple must be ‘this PDF document’ – an empty string. The value, however, may be a nested anonymous object.

Using the semantic web predicate owl:sameAs, the limitation that the RDF can only talk about the PDF document that it is embedded in can be circumvented at the semantic layer. While the RDF graph in a PDF document cannot directly talk about things that are not the document, owl allows us to do so by implication.

Even without adding additional metadata to a PDF document, exposing the metadata to the semantic web via the document DOI might be a reasonable thing to do.


To include RDF predicates in a PDF document using Addobe Acrobat Pro, go to to File > Properties and navigate through to “Additional Metadata”.

From here, predicates in an xml/rdf file can be added. Here’s a sample file:

  
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.0-c316 44.253921, Sun Oct 01 2006 17:14:39">
  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:cv="http://example.org/cv#"
    xmlns:amus="http://example.org/terms/"
    xmlns:amust="http://example.org/taxa/"
    xmlns:dwc="http://rs.tdwg.org/dwc/terms/"
    xmlns:owl ="http://www.w3.org/2002/07/owl#"
  > 
    <rdf:Description rdf:about="" xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
      <cv:sometext>This is some text</cv:sometext>
      <cv:someProperty rdf:resource="http://example.org/specimen/4321"/>
      <dwc:originalNameUsage rdf:parseType="Resource">
        <rdf:type rdf:resource="http://rs.tdwg.org/dwc/terms/Taxon"/>
        <cv:simple-literal-name>Taxonus simplex</cv:simple-literal-name>
        <owl:sameAs rdf:resource="urn:lsid:example.org:taxon:123456"/>
      </dwc:originalNameUsage>
    </rdf:Description>
  </rdf:RDF>
</x:xmpmeta>

There are several constructs which we have found Acrobat Pro will refuse to import. Any rdf:about other than “” will not be imported. It also will not recognise rdf:parseType="Literal".

However, the main thing here is that you can include anonymous objects using rdf:parseType="Resource". This allows you to do the “owl:sameAs” trick. Above, an anonymous object has rdf:type Taxon and a cv:simple-literal-name of “Some name”. By declaring this anonymous object to be owl:sameAs urn:lsid:example.org:taxon:123456, a reasoner can infer that that taxon has a simple literal name of “Some name”.


From here comes the question of “great, what do you do with it”?

In keeping with the semantic web structure, perhaps it might be appropriate that the RDF embedded in the PDF be returned when an HTTP fetch is made against the DOI with an Accept header of “application/rdf+xml”. This could be done either with a servlet that can parse pdf files, or it might be done by exporting the PDF to XML and using a stylesheet transformation to extract any rdf:RDF in the document.

Doing this means that the document metadata becomes part of the semantic web without much additional work. Even without adding additional metadata, the PDF metadata contains things such as creation time, document title and so on, all of which may be of interest to consumers of semantic data.


A fine distinction, but an important one.

January 8, 2015

Wrestling with vocabulary. Again.

I am approaching the job bottom-up: we have a set of tables, and we want to publish them. Rather than proposing an vocabulary and attempting a procrustean solution, I am simply exposing the data using d2rq and using the vocabulary to document what is exposed.

More or less.

Of course, nothing is ever that simple.

We have a number of data items in our system, eg: NAME.
We also have a number of tables that hold enumerations: NAME_GROUP, NAME_TYPE, NAME_STATUS and so on.

Now, every NAME has a NAME_GROUP. Name group is “botanical” or “zoological”. There is some argument about this, as these things are called “Nomenclatural Codes”. However, “CODE” tends to mean something else inside databases.

But NAME_RANK also has a name group (the collapsing of RDF identifiers means that this is many-to-many). As does NAME_STATUS.

My problem is – what class and predicate names do I use? I could use hasGroup for everything, It seemed to me to be wrong that the question “what is in this group?” would pull out both names and vocabulary items – ranks and statuses.

I could have a separate predicate name for each place where group appears, but this seems horribly over engineered. I mean, there’s more than one kind of group, so we are looking at nameNameGroup, nameTypeNameGroup, nameStatusNameGroup and so on. Horrible.

Thinking about this, I decided to take this approach. A name group is primarily a group of names. Then sense in which ‘nom. cons.’ belongs to group ‘botanical’ is different to the sense in which ‘Doodia’ belongs to it. ‘nom. cons.’ isn’t in the group. It’s simply that we are declaring that it is meaningful to apply that term to names that are.

And so I am using one predicate for “this name is in this group”: nsl_name:group, and a different one for “this vocabulary item can be used for names in this group”: nsl_name:nameGroup.

Not 100% sure about what these should be called, of course. I dislike putting ‘has’ on the front of all the predicates – it’s just noise. And maybe nameGroup above should be called something like applicableTo. But then it’s,

“applicable to what?”
“ok then: applicableToGroup”
“groups of what?”
“well all right: applicableToNameGroup”

which is arguably correct but 21 letters long. Any vocabulary term longer than ‘internationalization’ is just not acceptable, if you ask me.

Ok, so let’s just go with ‘group’. The problem now is that we now have two uris, nsl_name:group for the predicate and nsl_name:Group for the class that are identical except for capitalisation.

As I understand it: in RDF, case is significant. But to an http server, it is not supposed to be (although there’s plenty of servers ignore this). But the fact that HTTP servers are supposed to ignore case actually doesn’t matter in this case, because those URIs are uri fragments.

That is, this:

http://example.org/voc/name/group
http://example.org/voc/name/Group

might be a problem in linked data. It might be the case that the http server only ever returns one of the two files and one of those two vocabulary terms become inaccessible. But this:

http://example.org/voc/name#group
http://example.org/voc/name#Group

Isn’t a problem, because the http server is serving up the one document http://example.org/voc/name.rdf.

Meh – maybe I should put the ‘has’ back on the predate names.

WHERE { ?n nsl_name:group ?g }

vs

WHERE { ?n nsl_name:hasGroup ?g }

Maybe I should use inGroup rather than hasGroup.
Maybe GROUP should be a class.
Maybe I should generate a named individual botanical, a class Group.botanical, and declare that Group.botanical is owl:equivalentTo ( owl:hasValue group Group.botanical ).
Maybe I should swap that – doesn’t “botanical” make sense as a class name on its own?

But I’m not sure d2rq can generate something that complex.
Should they go in the static vocabulary files?
Who keeps them up-to-date?
What about NAME_TYPE, which has 25 values? Do I still want a static file for that?

Choices, choices. It’s all very much still in flux.


If this doesn’t work this time, I am going home anyway

October 8, 2014

The thing about having this curated tree thingumajig is that it’s general enough to handle, well, when ever you have a hierarchically organised document that changes over time.

I have managed to get the APC (Australian Plant Census) classification into it (although this is failing in TEST because the psql sequence for ids is capped at 10 million). But the australian plant census is not the only classification around. There is the vexing problem of “what family is this name in?” for – well – everything, really.

You see, names are built from other names. A species name has a genus. But something as basic as family if not really part of the name, as such. We need a classification for where we have put the names as far as our internal systems are concerned: all names – excluded, miscellaneous, what have you. Prior to this, that data was simply part of the name table. As I understand it, the problem as the names people work through what it all means, is that that is simply wrong.

So we need another classification alongside APC in my tree thing.

And this is absolutely not a problem at all. It’s built to do this. It will take the APC, and AFD trees, as well as the APNI classification and all of our other APNI-like classifications (AMANI and so on). Oh, and I have saved a slot for the herbarium classification – ‘CANB’.

But right now, I am attempting to load 188340 names from APNI into my long-suffering postgres instance running on my macbook. In a single transaction. It’s churning its little heart out, but happily it has a solid-state drive so I don’t have to listen to it weep.

It looks frozen. Wait a moment! I just got another line of output! Maybe I need to break this up somehow. Or I could just cheat (because I know that all of the nodes that are currently in state foo need to have bar done to them in this particular import).

9PM, I am going home.


It has been some time

October 2, 2014

It has been a while singe I blogged.

The National Species List team is now proceeding with development, with a focus on replacing the existing APNI Oracle Forms app. Data-wise, there are two important components.

The ‘names’ component is the main focus of APNI, and the one that grapples with the whole question of how taxonomic names are used in scientific publications. The basic idea is first, that whenever a name is used, it’s always used somewhere, and the author who uses that name is almost always citing it from somewhere else – this is simply how the whole process of scholarship works. Second, when a document contains more than one name, they are usually in some sort of relationship to each other at the place they appear.

Thus name ∩ reference ≡ instance. An instance is (almost always) a citation of some other instance, and it has other instances that it cites in the same place (obviously, this is done as a one to many cited_by value on the target). Of course, these relationships have many, many, many different types and sorting through what they are is a challenge for the scientists. Thus a paper may talk about a name, state that another name used elsewhere is a misspelling, and that a further name used somewhere else is a synonym.

The thing to note is that component mainly stores facts about published names – Dr So-and-so did indeed publish a paper where he used certain names and said certain things about them.

Most of the work done so far has been on this component and on a rather nice web app to replace APNI.

The other opponent is the tree component – my bit. My main focus over the past week or two has been getting the APC classification into it. This is nowhere near as straightforward as you might suppose.

The APC data is ink a table, the pertinent bits being “this id was an apc concept from this date to that date, and it was placed under this other id”.

I have dealt with the issue where the other id does not have a date range that matches the ranges of its sub-ids. This happens a lot, owing to the history of APC. Botanists, it seems, really don’t care much about higher classification above family. So Genus A got declared as being put under family F, then at a later date family F was added to APC (with no declared supertaxon), then at a later date still, once the higher classification was assembled (phylum Ph, class C, order O), F was put into order O where it belongs.

In the data, it looks roughly like this:

Taxon Supertaxon From To
G F 1998 ongoing
F 1999 2002
C Ph 2000 ongoing
O C 2000 ongoing
F O 2003 ongoing

My system deals with this by creating “declared supertaxon” nodes for ids that are declared as being supertaxa over time ranges where those supertaxa are not APC concepts. That way, if five genera were put in a family that wasn’t in APC until later, that information is captured. You can see the APC tree grow over time – I should post some screenshots.

This isn’t my current problem.

My current problem is this:

Many APC records declare ids that are not APC instances, but the names on those instances are the same as some other instance that is in APC. This happens whenever someone went “we are not using Dr A’s publication for genus G anymore, we are using Dr B’s.” Of course, all the species (S*) in genus G still declare that Dr A’s version of G is their supertaxon.

So the final step of my import is “After loading everything, make a change using right now as the date mark. Anything that has a declared supertaxon that isn’t in APC, where there is a supertaxon that is in APC, should be moved there.”

And this blew up, blew up, blew up. Eventually I found the problem. You see, four or five APC records declare that their supertaxon is the same as the taxon. I ignore that, no probs. But there are also some that declare a supertaxon with the same name as themselves. Which hadn’t been a problem until now. But this new final step attempts to move those taxa to be under themselves.

The nice thing is that my tree manipulation layer catches this nonsense and throws an exception. That is – we don’t wind up with screwed-up data in the final tables. The versioning algorithm says “WTF?” and correctly throws an exception. Which, as a certain colleague of mine would say, is quite pleasing.

So, I wrote a fix. It ran against a test. It didn’t run against the full load. Is this because my fix doesn’t work, or because there is some other even more devious and subtle problem in the data? (UPDATE: more devious and subtle, it seems.)

Here’s an example:

Blue arrows = ‘this node was copied to that node’, a version branch.
Red arrows = ‘this node was replaced by that node’, a version merge.
Black/grey lines = subtaxa
black = current, grey = no longer current
rectangle = taxon
hexagon = unnamed classification root
oval = none of the above. There is one of these: a tag I attach to the APC import itself, prior to the final mixup operations. This tag doesn’t have a name, but there is a comment on it.
flag = a named tag
no arrowhead = a versioning link
dot arrowhead = a tracking link
anchor arrowhead = a fixed link

APC concept 39864 is for apni instance 242777 with name 309388. It declares its supertaxon to be apni instance 242775 which also has name 309388. My system responds by moving all subtaxa of 242775 to 242777 except for 242777 itself. This is moved up to the root of the APC classification. This leaves 242775 dangling there by itself – but that’s not a problem big enough to worry about.

This diagram also shows a final tree operation in which no changes were made (purely because this is only a partial data set). You can see that operation in the move from 4965798:4965842 to 4965798:4965856.

In any case. I printed out the 8 problematic records and gave them to one of the scientists here. She was like “What the hell? Oh I see – the Genus isn’t in APC.” Son in the long term, these problems simply will be fixed.


Tree-ing

March 17, 2014

So I attempted to explain what I had in mind at work today, and managed to determine that my ideas are pretty half-baked. The more I think about it, the more it’s clear that the new model needs to do everything that the tree model does.

The tree idea is pretty mature at this stage, and does almost everything. What it doesn’t do is the main take-away of the list idea: for the current version of a tree, every item is numbered in sequence. You can get all the Pinacea by finding the taxon, asking it what row range it covers, and then selecting and ordering by row number over that range.

Can this notion be back-fitted onto the tree idea? Yes it can.

First, the subnodes of a node need to have a definite order.

Next, the notion of “a checklist” corresponds to “a tree root”. This is to say, I need to distinguish between the whole history of a tree, and a particular point in that history. I’m thinking of calling them ‘tree’ and ‘checklist’. ‘Tree’ becomes a partition of the data set, to which things like permissions and user groups might be attacked.

Each node, in addition to belonging to a ‘tree’, belongs to some particular checklist and has a sequence within that checklist. This needs to get updated when a versioning is performed. Basically, these tree roots are much more explicitly managed by the system.

Old checklists still need to be tree-walked, as we are doing now. There’s no easy way to get item 50 in checklist 9 once checklist 9 has been replaced. But although there’s no easy way of doing it, it can still be done with a recursive query which can zip down the tree to exactly the right spot. The trick is to note the offset between where checklist 9 says the node is and where the current checklist (checklist 11) says it is, and to carry that offset down the tree walk. To tell the truth, I suspect it will work rather well, and it will work far better than the current model which just has the nodes floating around in space.

Yep – retain the existing versioning model, which I am confident about, and add a notion of ‘absolute position in the current checklist’. Keeping that correct will be an interesting and important addition to the existing update queries, but it probably wont need a load of completely new stuff. One extra table to explicitly hold the checklist roots, and some new fields to hold numbers.

(note to self – store the depth as well. Makes it easy to produce an indented list with just the select by row range.)


Ok.

March 15, 2014

Alright.

We have a table of the checklists. A “checklist” is a version at a particular time, so there are multiple records for ‘APC’ but only one of these will be the current APC checklist, with a null replaced_by.

id Checklist id
copy_of_id Checklist history pointers
replaced_by_id
persisted_at_ts Checklist history timestamps
replaced_at_ts
Other data fields. Labels, owner, long title, uri, etc.

We have two tables for the items in the checklists: checklist_p and checklist_v, p=’physical’ and v=’virtual’.

Checklist_p always contains the current version of every checklist. In particular, checklist_p contains the taxon ids. The data structure is

id The physical id
cl_id The checklist id
row The row number relative to the checklist (starting at row 0).
parent_row Tree structure in doubly-linked grapevine format. I’m thinking about using offsets rather than numbers for reasons which shall be explained.
(I might make these names shorter)
size
prev_sibling_row
next_sibling_row
first_child_row
last_child_row
Taxon id and other data items

Note that this makes it straightforward to read out current lists in row order, and to check that current lists do not have duplicate taxa. If we slap a unique constraint on cli_id/taxon_uri, then job done.

Row number zero has a null taxon_id, it’s used so that a checklist may contain multiple items at the top level.

Checklist_v contains the arrangement of all checklists. It is similar in structure to checklist_p – a grapevine list using row numbers. Each item in the list refers to another list item either in checklist_v or checklist_p. Each item in the list either uses the arrangement of sub nodes from the thing referred to, or has its arrangement in checklist_v.

This means that checklist_v, while being structured like checklist_p has entries missing – they are located elsewhere. You can see why I am thinking that storing the tree structure as offsets might be the go. It means that when you chase a reference, you don’t have to keep track of an adjustment to the row numbers referred to.

My main concern is whether or not checklist_v should be permitted to refer to itself. The problem with this is that it means that reconstructing old checklists becomes a tree-walk of arbitrary depth. The problem with *not* doing it is that if a node is rearranged so that it no longer appears in checklist_p, then the node as it was needs to be reconstructed in checklist_v everywhere that it appears. In effect, checklist_v fills up with crud that we don’t really care about anymore but we have to keep.

But it’s not as bad as it appears – nodes get moved (more on this in a minute), so checklist_v self-references don’t follow a history, they point to the latest place where the node appears in that form. The very cool thing about this is that we can show that on a printout of an old, reconstructed list.

To continue.

The basic job of editing is this:

You construct a new checklist made up of bits of other checklists and new data which you enter. You can opt to push an entry in your list into another (current) list – replacing a node in that list with a node in yours. Note that inserts and deletes are done by replacing the parent node of the node you are trying to insert.

When you do this, a new checklist is created. The nodes from your list and the unaltered nodes from the target list are moved into the new list and replaced with references.

And, well, and that’s it. (TODO: what happens to the NSL list? it has to be left in permanent “I am not a current list” state, or something.)

No, actually, that’s not it. The other issue is where two people are hammering away at the same list. A replacement of a node at high level means that the checkout needs to be turned into a checkout of the new node. Unaltered sibling nodes need to be fixed up …. I still need the notion of a ‘tracking’ link. Which is kind of comforting. This system is probably isomorphic to my git-like nodes system, just redone with row numbers to make working with the current list much easier.

A worked example

Actually – no. I’ll start coding, then once that’s done do an update and then dump the results. Which is how those graphs were done.


March 13, 2014

A checkin is ‘copy this list fragment from that list over the top of this fragment’.

If a bit is unchanged as a result of the checkin (ie, everything else), then conceptually it gets moved to the new list and replaced with a reference. Cool. Now, if the physical I’d is unchanged, this means that physical ids will always point to the latest occurrence of the item. The item needs to be usdataed to say “I belong to this list, now”. A node should be marked, then, with the place where it was first made. This way, we can say that a list contains Polychaeta as per AFD timestamp-to-timestamp.