AFD Leaking, Part II


So I had Sally run AFD, then I ran 23 wget processes to slurp AFD from localhost. What I found was that the thing was jamming when wget asked for – for instance – “all host taxa for Nematoda”, or “The bibliography for all of Mammalia”. So I instituted some savage limits on those links and redeployed it to prod with some additional logging to report the user agent for calls asking for big downloads.

And what do you know. Every one of them was a spider. All of ’em. Mostly bing.com, and a few others I don’t recognise.

Way forward: keep certain user agents away from those pages in particular. It’s ok for them to fetch taxa/AVES, but not taxa/AVES/bibliography or taxa/AVES/complete . Regrettably, robots.txt does not support globbing, so we can’t put the rule in there. We can jam it in the reverse proxy, or in the AFD app itself.

This done, we can relax the download limits so that legitimate users can get their names lists.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: