Embedding semantic web statements in PDF documents

PDF documents contain a “Document Metadata” section, permitting a subset of RDF to be embedded in a PDF document. However, the subject of each triple must be ‘this PDF document’ – an empty string. The value, however, may be a nested anonymous object.

Using the semantic web predicate owl:sameAs, the limitation that the RDF can only talk about the PDF document that it is embedded in can be circumvented at the semantic layer. While the RDF graph in a PDF document cannot directly talk about things that are not the document, owl allows us to do so by implication.

Even without adding additional metadata to a PDF document, exposing the metadata to the semantic web via the document DOI might be a reasonable thing to do.

To include RDF predicates in a PDF document using Addobe Acrobat Pro, go to to File > Properties and navigate through to “Additional Metadata”.

From here, predicates in an xml/rdf file can be added. Here’s a sample file:

<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.0-c316 44.253921, Sun Oct 01 2006 17:14:39">
  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:owl ="http://www.w3.org/2002/07/owl#"
    <rdf:Description rdf:about="" xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
      <cv:sometext>This is some text</cv:sometext>
      <cv:someProperty rdf:resource="http://example.org/specimen/4321"/>
      <dwc:originalNameUsage rdf:parseType="Resource">
        <rdf:type rdf:resource="http://rs.tdwg.org/dwc/terms/Taxon"/>
        <cv:simple-literal-name>Taxonus simplex</cv:simple-literal-name>
        <owl:sameAs rdf:resource="urn:lsid:example.org:taxon:123456"/>

There are several constructs which we have found Acrobat Pro will refuse to import. Any rdf:about other than “” will not be imported. It also will not recognise rdf:parseType="Literal".

However, the main thing here is that you can include anonymous objects using rdf:parseType="Resource". This allows you to do the “owl:sameAs” trick. Above, an anonymous object has rdf:type Taxon and a cv:simple-literal-name of “Some name”. By declaring this anonymous object to be owl:sameAs urn:lsid:example.org:taxon:123456, a reasoner can infer that that taxon has a simple literal name of “Some name”.

From here comes the question of “great, what do you do with it”?

In keeping with the semantic web structure, perhaps it might be appropriate that the RDF embedded in the PDF be returned when an HTTP fetch is made against the DOI with an Accept header of “application/rdf+xml”. This could be done either with a servlet that can parse pdf files, or it might be done by exporting the PDF to XML and using a stylesheet transformation to extract any rdf:RDF in the document.

Doing this means that the document metadata becomes part of the semantic web without much additional work. Even without adding additional metadata, the PDF metadata contains things such as creation time, document title and so on, all of which may be of interest to consumers of semantic data.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: