Bulk loading into eXist XML database


Well, I finally managed to get data into eXist more-or-less reliably.

A little Background

eXist is an XML database. It looks like a WEBDAV filesystem containing xml documents. The xml documents in it look like files in directories, but they are stored as DOM nodes. The important thing is that any directory (“collection”, in xmldb-speak) may have a config file associated with it telling eXist that certain xml elements or attributes are to be indexed.

The key thing is how this interacts with xpath. Xpath is an expression language. Given a set of xml nodes as a context, an path expression returns a different set of xml nodes. In an xml database, xpath may be executed in the context of a collection. When this is done. the initial set of nodes is the document node of all documents in the collection and its subcollections.

The data that I put into eXist is a set of several hundred documents, each containing a few thousand taxon name, taxon concept, and publication elements. Each of these elements has an “lsid” attribute. Inside eXist, I evaluate this expression:

collection("/db/Documents")//*[@lsid=$FOO]

to get all xml elements in the “Documents” directory/collection that have an LSID of FOO. The fact that these elements are scattered cross several hundred files doesn’t matter. More importantly: because I have put an index on @lsid, eXist finds the elements pretty much straight away. The result is a reasonably quick lookup.

http://biodiversity.org.au/apni.taxon/54321.html

Loading the data

I divide it into several smaller files – the code that extracts the data from the database as XML breaks the files each 16 meg (uncompressed). To get it into exist, you do an http PUT to the appropriate spot. However, it’s all a little fragile.

I have tried using several threads, I have tried doing one file at a time, but eventually eXist simply jams. If you kill -9, then eXist screws up its data files. The only thing that works is to bring up eXist, load in a few files, and then kill it (which sends a SIGINT).

And so I wrote a script to do this. But even then, it would trip over itself. I tried sleeping, I tried looking for the lock file – eventually I created a hybrid method that seems to work.

The initialisation routine sets some variables:

EXISTDIR The installation directory of the exist binaries
DATADIR The exist data directory. I don’t use web app/WEB-INF/data – the data is in /var and there’s a system of symlinks so the I can swap out bad data sets quickly.
LOCKFILE =${DATADIR}/dbx_dir.lck The exist lock file.

Ok. The outer loop is:

while [ $# -ne 0 ]
do
    unset FF
    for i in 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6
    do
        FF=$FF' '$1
        shift
    done 
    up $FF
done

I originally had

    FF = $1 $2 $3 $4 $5 $6 $7 $8 $9 $10

However, it turns out that $10 is parsed as ${1}0 . Drat. This loop does 16 files at a time. I am not sure how it would handle file names with spaces in the name – but that’s not something I have to worry about. Bash does do arrays, but I don’t need them for this.

The inner loop (function up) does this:

First, a little bulletproofing. If exist is running, then I was doing something with it other than a simple data upload and so that should be taken care of rather than proceeding.

    if [[ -f $LOCKFILE ]]
    then
        for i in $@
        do
            echo "Data directory is locked. Not uploading." | tee -a  upload-log.$$.txt
        done
        exit
    fi

Launch the server and make a note of its PID. Note: this PID is the pid of the script, not that of the server itself.

    $EXISTDIR/bin/server.sh > /dev/null &
    EXISTBATPID=$!

Wait for the server to well and truly get started. First, I tried waiting for the lockfile.

    while [[ ! -f $LOCKFILE ]]
    do
        echo "waiting for lockfile"
        sleep 2
    done

But even then the thing would trip over itself. Finally, I worked out that the issue was the http component – it starts last of all, and must be waited for. So I wait until I can get a directory listing from it.

    
    until curl http://localhost:8080/db/Documents > /dev/null 2>1
    do
        echo "waiting for http"
        sleep 2
    done

Ok. Everything is going. Now I find the process id of the exist process itself and remember it. Using the solaris ptree command, I search for a java process that is a child of the script I launched. (Come to think of it, I could just ptree $$ and ditch $EXISTBATPID altogether.)

    EXISTPID=$(ptree $EXISTBATPID | grep java | sed 's/ *//' | sed 's/ .*//')

Now we loop and upload each file. There’s some code to unzip the file which I won’t show here. After the unzip, I wait for eXist to settle down, for its percent of CPU to drop below 20%

    while true 
    do
        PCPU=$(ps -o pcpu -p $EXISTPID | grep -v CPU | sed 's/ *//g' | sed 's/\..*$//')
        if (( $PCPU <= 20 ))
        then
            break
        fi
        sleep 1
    done        

Next, we do the actual curl. We are looking for a 201 status code in the http response. eXist will complain if the XML is invalid, and I want to capture that information. Hence the tee prior to grepping for the 201.

    curl --user "$USER:$PASS" -i --upload-file ${PIPE}.xml "http://localhost:8080/db/Documents/${COLLECTION}/${NAME}.xml" 2>&1 \
        | tee upload-file.$$.txt \
        | grep 'HTTP/1.1 201 Created' > /dev/null    
        
    if [ $? -ne 0 ]
    then
        echo "upload failed."
        echo "UPLOAD $BASE FAILED " | tee -a  upload-log.$$.txt
        cat < upload-file.$$.txt >> upload-log.$$.txt
    fi

Finally, we shut down exist with a SIGINT and loop until it is well and truly shut down.

    while [[ -f $LOCKFILE ]]
    do
        kill $EXISTPID
        sleep 1
    done

    while ps -p $EXISTPID > /dev/null
    do
        kill $EXISTPID
        sleep 1
    done    

And that does the job, safely loading hundreds of large files into eXist in batches of 16 by launching and shutting down exist fully each time.

I wind up with a 19GB data directory which can be scp-ed across to prod.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: