AFD Leaking, Part III


Well, the fix is in.

  • Robots are prohibited from viewing the bulk pages: names, host taxa, bibliography.
  • Robots get a truncated checklist
  • Robots get a publication page without lists of references or sub-publications

All of these pages are aggregations of information that is already in the main profile pages, and that’s what we want robots to be indexing.

Robots are detected via their User-Agent headers. I snarfed a list of known robots from useragentstring.com.

My logs show the same number of hits from spiders, but the URIs taking the most amount of time to respond are now mostly firefox clients.

Mozilla/5.0 (Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6

Will this fix things? By God, it better. Still a few worrying bits. A nine-second request for

http://biodiversity.org.au/afd/taxa/Avocettina/bibliography

With a UA of

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Applebot/0.1; +http://www.apple.com/go/applebot)

Which my filter is supposed to be catching.

Time will tell. If I revisit this, I will look at implementing catching of If-Modified-Since, which is an outstanding feature request anyway.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: