Embedding semantic web statements in PDF documents

March 7, 2017

PDF documents contain a “Document Metadata” section, permitting a subset of RDF to be embedded in a PDF document. However, the subject of each triple must be ‘this PDF document’ – an empty string. The value, however, may be a nested anonymous object.

Using the semantic web predicate owl:sameAs, the limitation that the RDF can only talk about the PDF document that it is embedded in can be circumvented at the semantic layer. While the RDF graph in a PDF document cannot directly talk about things that are not the document, owl allows us to do so by implication.

Even without adding additional metadata to a PDF document, exposing the metadata to the semantic web via the document DOI might be a reasonable thing to do.

To include RDF predicates in a PDF document using Addobe Acrobat Pro, go to to File > Properties and navigate through to “Additional Metadata”.

From here, predicates in an xml/rdf file can be added. Here’s a sample file:

<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.0-c316 44.253921, Sun Oct 01 2006 17:14:39">
  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:owl ="http://www.w3.org/2002/07/owl#"
    <rdf:Description rdf:about="" xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
      <cv:sometext>This is some text</cv:sometext>
      <cv:someProperty rdf:resource="http://example.org/specimen/4321"/>
      <dwc:originalNameUsage rdf:parseType="Resource">
        <rdf:type rdf:resource="http://rs.tdwg.org/dwc/terms/Taxon"/>
        <cv:simple-literal-name>Taxonus simplex</cv:simple-literal-name>
        <owl:sameAs rdf:resource="urn:lsid:example.org:taxon:123456"/>

There are several constructs which we have found Acrobat Pro will refuse to import. Any rdf:about other than “” will not be imported. It also will not recognise rdf:parseType="Literal".

However, the main thing here is that you can include anonymous objects using rdf:parseType="Resource". This allows you to do the “owl:sameAs” trick. Above, an anonymous object has rdf:type Taxon and a cv:simple-literal-name of “Some name”. By declaring this anonymous object to be owl:sameAs urn:lsid:example.org:taxon:123456, a reasoner can infer that that taxon has a simple literal name of “Some name”.

From here comes the question of “great, what do you do with it”?

In keeping with the semantic web structure, perhaps it might be appropriate that the RDF embedded in the PDF be returned when an HTTP fetch is made against the DOI with an Accept header of “application/rdf+xml”. This could be done either with a servlet that can parse pdf files, or it might be done by exporting the PDF to XML and using a stylesheet transformation to extract any rdf:RDF in the document.

Doing this means that the document metadata becomes part of the semantic web without much additional work. Even without adding additional metadata, the PDF metadata contains things such as creation time, document title and so on, all of which may be of interest to consumers of semantic data.


January 29, 2016


Our goal is to provide services by which other people may build webapps. The idea is, you GET something or POST something, and get back some JSON. This then goes to any number of client-side frameworks that are happy to work with JSON.

No probs, so far.

Oh, and of course I need to authenticate.

Ha ha ha! You poor fool! Welcome to the wonderful world of CORS. There’s rather a complicated dance you have to do, primarily because the world is full of bastards trying to steal your credit card details. It’s complicated for the same reason that banks have bullet-proof glass and safes all over the shop.

Now, I’m using Angular, and they seem to like this thing called a JSON Web Token – JWT. You log in to your server, you get back a JWT, and you include that in a HTTP header. Of course, this has to be over https or it’s game over.


JWTs have a structure, but at the moment on my dev box the JWT is just the user name – no password, no nothing. Not sure how it’s supposed to be done, but on my dev box (I stress this is nowhere near prod), I’ll pass it back and forth in an http header named nsl-jwt.

Back end

Our grails app is using Shiro. I build a credentials object named JsonToken, whose job it is to hold the token that gets put in the nsl-jwt header. It has two constructors: one that takes a string (the header content), and one that takes a Shiro subject. The job of the one that takes the shiro subject is to build the token for a given user. Maybe this should be a builder method.

In any case, I have a security realm named JsonTokenRealm, which knows how to interpret JsonToken credentials objects. This realm is a spring bean, obviously, which holds a reference to the LDAP security realm bean by injection. It implements authenticate, but everything else is deferred off to LdapRealm. (In principle – I see it’s commented out at present).

So that’s yer back-end.

Front end

For the front end, my angular app makes a call to signInJson. This takes a username and password, and returns a JSON packet with the JWT (and potentially MOTD, etc). This JWT needs to be included into subsequent calls, in the nsl-jwt header.

To do this, you call the angular module’s config method, passing it a method that has an injectable $httpProvider parameter. This method pushes a function into the interceptors, that returns a map of functions that respond to the various JSON request lifecycle events: in this case, request. That method stuffs the JWT into nsl-jwt. At this stage, I am just using ‘p’, because I will figure out later how the injected request function is supposed to get the application scope. (p is a user name in the LDAP server running on my DEV box).


app.config(['$httpProvider', function($httpProvider) {
    function() {
      return {
        'request': function(config) {
          config = config || {};
          config.headers = config.headers || { };
          config.headers['nsl-jwt'] = 'p';
          return config;

And there’s yer front-end.

Not so fast

Oh, did you think that was it? Of course it isn’t. Nothing works.

The client app identifies objects entirely by their ‘mapper’ id. The job of the mapper is to convert these abstract URIs into concrete URLs. It’s job is to know what servers (‘shards’) own which URIs, and to generate a 303 redirect.

Now, we have already addressed the host cross-origin issue. There was a grails plug-in, which doesn’t seem to work. So I added a hard

response.addHeader('Access-Control-Allow-Origin', '*')

To the fun bit. But this isn’t enough when you are also sending along suspicious http headers. Oh noes! So I added a

response.addHeader('Access-Control-Allow-Headers', 'nsl-jwt')

Then firefox started to whine about a ‘preflight check’. Turns out that when your JSON request has weird stuff, the CORS spec makes you send an OPTIONS request. The mapper doen’t respond to these (Why would it? After all, what kind of weenie sends obscure HTTP reqests?) SO I wrote some stuff.

Aaaaand … it doesn’t work. Firefox sends OPTIONS, gets back the access control gear, sends the GET, gets back the 303, then stops. The angular http provider don’t follow redirects when there’s weird complicated CORS (works fine if I don’t include that nsl-jwt header. Maybe there’s something I’m missing.

Three possible fixes

Manually handle redirect

Rig up the ‘refresh me’ code to do its own redirect loop

Distinguish between mapper and other ids

I think this is the simplest fix. We only need the JWT when we are doing an edit, and the edit URLs all have a specific structure. The request hook just needs to look at the URL being requested to decide if the JWT needs to be there.

Read the goddamn CORS spec

Sigh. And this is probably the right way to do it. Spec is at https://www.w3.org/TR/cors. As I am old-school, I will need a printer and my reading glasses.


Well, it looks like CORS just plain doesn’t do redirects when life becomes complex. Section 7.1.4 (simple request) has explicit handling of redirects. Section 7.1.5 (cross-origin with preflight) states that either in the preflight step or the actual request step, if the response code is anything outside the 2XX range, then error.

I suppose it’s an application’s job to know what servers it’s talking to and which need passwords, so option 2 (if it’s an edit, then include the password) becomes reasonable.

I’ll roll back my mods to the mapper, whip out the code on the andgular $http.request hook, and have the controllers explicitly include the JWT for those JSON requests that are edits.

Putting out fires

December 22, 2015

For anyone that cares, this is what our crashes look like.

TOP 20 url times are:
39432223* Safari/601.3.9, search/names?&search=true&keyword=S%25%25%25
39371188* Safari/601.3.9, search/names?&search=true&keyword=S%25%25%25
39310161* Safari/601.3.9, search/names?&search=true&keyword=S%25%25%25
39300736* Safari/601.3.9, search/names?&search=true&keyword=S%25
39041255* Safari/537.36, taxa/LYCAENIDAE/complete
38945840* Safari/537.36, taxa/LYCAENIDAE/complete
38930099* Applebot, taxa/Gymnothorax_pictus/checklist
6006648* Safari/537.36, taxa/LYCAENIDAE/complete
5975658* Safari/537.36, taxa/LYCAENIDAE/complete
5914978* Safari/537.36, taxa/LYCAENIDAE/complete
5914961* Baiduspider/2.0, taxa/Talaurinus_prypnoides/checklist
5863731* Baiduspider/2.0, taxa/Platyzosteria_jungi/checklist
5838122* Baiduspider/2.0, taxa/Pterohelaeus_litigiosus/checklist
5643696 Applebot/0.1, taxa/Eulecanium/checklist
5582189 Applebot, taxa/0c1c84ef-4ad1-403f-a7f7-f45447a0372a
5527142 Yahoo! Slurp, taxa/908440b7-da2f-4840-a81f-b5b3b5a2c14e
5487277 Baiduspider, taxa/Ropalidia_plebeiana/statistics
5430266 Yahoo! Slurp, taxa/Eumelea%20duponchelii
5025511* Baiduspider/2.0, taxa/Pseudostrongyluris_polychrus/statistics
4967732 Firefox/24.0, taxa/Siganus/checklist

java.lang.OutOfMemoryError: GC overhead limit exceeded 

The numbers on the left are the request duration in milliseconds. An asterisk indicates that the request is still ongoing. As you can see, these requests are taking 10 hours to return. Obviously, I need some sort of watchdog to interrupt threads.

The entries are sorted in order of duration, so the oldest requests are sorted to the top. This tells the story. Someone, whose IP address I am not repeating here, searched for ‘S%%%’. Then went “hmm, it’s not coming back”, they hit the search button twice more, then got rid of the extra percentage signs and just searched for ‘S%’. And then, I suppose, concluded that AFD “doesn’t work” and went away.

To fix this, there’s a validation rule on the basic name search path through the app: if you use a wildcard, you must also have three non-wildcard characters. Its a stronger version of the previous validation rule, which checked only for searches that were only wildcards.

The obvious question is, “why didn’t you think about this in the first place?” The answer is that we chose to make it permissive because we have legit users who really do want all the names in AVES and should be able to get them. Now I’m putting out fires, adding limits of various kinds on a case-by-case basis. It’s bitsy. Far from ideal.

What to do, what to do.

  • Google “adding a watchdog timer to a webapp”. The difficulty is that web applications are not supposed to start their own threads. Tomcat will probably let me do it, but it shouldn’t. Need to find the proper way to go about this.
  • Why are those requests for checklists taking so long? Oh – that’s right. I preload the javascript objects with all sibling items all the way up t the root. Perhaps I should root the checklist at whichever taxon the user requests a checklist for, and provide breadcrumbs allowing the to navigate up. It would be faster and also cleaner and better-looking.

Decision time

December 16, 2015

So, it was decision time today. What shall our publically-published identifiers look like? For real, this time.

At work, we expose quite a bit of data to the web, especially to “The Semantic Web” (henceforth: ‘semweb’). The basic notion is that the semweb as a whole is a single, vast, extended triple-store with an object model on top of it. Identifiers identify atomic things, and these things have attributes with (possibly multivalued) values. The values may either be literals, or other objects.

The identifiers in this system are URIs. URIs are opaque and don’t mean anything. However, in the semantic web of linked data, there is a convention: idenifiers should be http URLs, and you should be able to perform an HTTP GET on them. If you do this, some service will spit back a 303 redirect to a URL which will give you – well – something meaningful about the URI. In practice, you get a HTML page or an RDF (or even JSON) document, depending on the HTTP headers.

In our system, the identified things have a three-part compound key.

They have a namespace. This namespace corresponds to a ‘shard’ of our system. A shard is both a collection of objects that someone is responsible for, and corresponds to an instance of our services running over the top of some particular database instance. Currently, we have an APNI (Austalian Plant Name Index) shard and an AusMoss shard.

They have an object type. We expose names (that is: scientific and other names of things biological), references, instances of names at a reference, and classifications of instances (ie: instances that someone has organised into a tree structure).

And, of course, they have an id. As it happens, the id is unique within the namespace.

(Physically, these three parts are schema, tablename, and rownum. We have a single id sequence for the schema.)

Up till now, we have been exposing URIs that look like this:


When one of these URIs comes into the site, it is sent to the linked-data “mapper”, whose job is to reply to this URI with a URL that will get to our service engine. For instance, the mapper could (it doesn’t, but it could) respond with a redirect to:


The mapper’s job, in part, is to “know” that if the namespace is ‘apni’, then that is served up by the server sitting in cluster 5. This may change, and that’s ok. What matters is that the URI doesn’t.

Problem is, we don’t like our URIs. the ‘boa’ part means nothing. We could remove it, but this plays havoc with the reverse proxy. It means that every object type needs to be put into the Oracle reverse proxy config, and it means that we can’t manage our URL namespace freely when it comes to deploying services. And there’s a variety of other problems that were hashed out over the afternoon.

Eventually, the solution we chose was this:


This has a variety of advantages.

Although URIs are in theory opaque, of course they are not in practise. People who look at a URL would like to have some notion of what they are going to get back before they click the link. These URIs do that. You can read it. It’s an id. Of a name. Belonging to apni.

We can proxy all things that might be linked data URIs through to our linked data service without having to worry any further about it.at the firewall. Our semantic web portal has its own little world in which it can do whatever it likes, and that’s fine.

We can rearrange our web pages and deploy URL prefix spaces at biodiversity.org.au freely, without the possibility that we migh need to reorganise everything should we – for instance – decide to publish identifiers for specimens. To put it another way: the biodiversity.org.au domain was doing two jobs: it was a namespace for identifiers, and it was also a working website with static pages, webapps, and so on. Now it only does the one job.

Because it has its own domain name, the linked data service can easily be detached and moved into the cloud. The service is very simple, but works with a huge amount of data. Hosting it specially on an entirely different ip and site may in future make a lot of sense.

And finally, this is a decent general solution. If you run somefrogs.org and want to expose linked data, id.somefrogs.org makes perfect sense. If everyone used an ‘id.X’ subdomain for linked data identifiers minted by organisation X, I think that would be swell.

In fact, with a bit of fiddling about you could point your id.somefrogs.org subdomain at our mapper in the cloud, and we could host your ids. Because it’s a subdomain, your website would not need any changes. We’d agree on a namespace id for your stuff, you’d explain how we convert your object type/id pairs into URLs, and off we go.

Anyway. It’s what we are doing.

AFD Leaking, Part III

December 15, 2015

Well, the fix is in.

  • Robots are prohibited from viewing the bulk pages: names, host taxa, bibliography.
  • Robots get a truncated checklist
  • Robots get a publication page without lists of references or sub-publications

All of these pages are aggregations of information that is already in the main profile pages, and that’s what we want robots to be indexing.

Robots are detected via their User-Agent headers. I snarfed a list of known robots from useragentstring.com.

My logs show the same number of hits from spiders, but the URIs taking the most amount of time to respond are now mostly firefox clients.

Mozilla/5.0 (Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6

Will this fix things? By God, it better. Still a few worrying bits. A nine-second request for


With a UA of

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Applebot/0.1; +http://www.apple.com/go/applebot)

Which my filter is supposed to be catching.

Time will tell. If I revisit this, I will look at implementing catching of If-Modified-Since, which is an outstanding feature request anyway.

AFD Leaking, Part II

December 14, 2015

So I had Sally run AFD, then I ran 23 wget processes to slurp AFD from localhost. What I found was that the thing was jamming when wget asked for – for instance – “all host taxa for Nematoda”, or “The bibliography for all of Mammalia”. So I instituted some savage limits on those links and redeployed it to prod with some additional logging to report the user agent for calls asking for big downloads.

And what do you know. Every one of them was a spider. All of ’em. Mostly bing.com, and a few others I don’t recognise.

Way forward: keep certain user agents away from those pages in particular. It’s ok for them to fetch taxa/AVES, but not taxa/AVES/bibliography or taxa/AVES/complete . Regrettably, robots.txt does not support globbing, so we can’t put the rule in there. We can jam it in the reverse proxy, or in the AFD app itself.

This done, we can relax the download limits so that legitimate users can get their names lists.

AFD Leaking

December 14, 2015

Goddammit, the AFD public pages are leaking memory.

Only in production, of course. We have a cron job to reboot it every day at 11PM. It’s not enough – only lasts 10-12 hours before blowing up again.

I tightened up the memory scavenging and sharing in the servlet that fetches and caches maps from the geoserver. Really thought this would fix it. Didn’t work – AFD continued blowing up.

I added a logging filter to spit out, at five-second intervals, what pages were being fetched and how long they were taking. I found that usually, there were about 8-12 sessions max at any particular moment, but just before a blow-up the number of concurrent sessions would shoot up.

So I added a filter that would count the number of active requests and throw a 503 if there were more than 20. Didn’t work – AFD continued blowing up.

I got my logging filter to spit out stack traces. Found that there were suspicious things relating to oracle connections. So I tweaked the oracle connection properties. Removed the max session age limit, because the AFD public app is read-only (the age limit was 10 hours, which was suspiciously similar to the interval between blow-ups). Added an item to cache prepared statements.

After this, the performance of the pages dramatically improved. Maybe I’m just fooling myself – I didn’t actually take timings. But it feels a hell of a lot snappier. I imagine that caching the prepared statements means that JDBC no longer asks oracle “what columns does this return and what are their types?” when it runs a stored procedure it already knows about. Big, big savings.

But – AFD continued blowing up.

So what the hell do I do now?

Flush Hibernate

Maybe hibernate is caching things – not completing the transaction, leaving things lying around. A filter to flush hibernate once the page is done might fix things.

Flush Memory

Add something to force a gc every now and then?

Revisit image caching

I use the usual pattern to cache images – a WeakHashMap of soft references to objects holding the hashmap key and the cached object. While developing this, I added some logging and determined that yes – the references were getting cleared. But when I attached phantom references, I found that the phantom references never got enqueued. Maybe I misunderstood phantom references. Maybe there really is a problem – perhaps a manual gc is needed.

A potential fix would be to supplement the soft reference behaviour with something that manually drops images from the cache.

Fix bulk CSV

Browsing through some of the old code, I found something very, very nasty. Some of the bulk csv pages operate by collecting the page into a StringBuffer and then spitting it out. This is a pending bug – maybe it’s time to address it.

Enable JVM debugging in PROD

Maybe I’d get a picture of what, exactly, is clogging up memory if I could get onto the JVM with a debugger.

Duplicate the problem in DEV, and debug it there

You know, this is what I ought to do. Rather than throwing things at PROD. DEV is my desktop machine, Sally. How to duplicate the problem. though? Well – how about running a dozen copies of wget, pointing them at localhost, and maybe cutting tomcat memory right down.

Way forward

Ok. I will attempt to have Sally duplicate the behaviour. Point debugger at local tomcat and find out what java classes are hogging the memory.

Thanks for the advice.