AFD Leaking, Part II

So I had Sally run AFD, then I ran 23 wget processes to slurp AFD from localhost. What I found was that the thing was jamming when wget asked for – for instance – “all host taxa for Nematoda”, or “The bibliography for all of Mammalia”. So I instituted some savage limits on those links and redeployed it to prod with some additional logging to report the user agent for calls asking for big downloads.

And what do you know. Every one of them was a spider. All of ’em. Mostly, and a few others I don’t recognise.

Way forward: keep certain user agents away from those pages in particular. It’s ok for them to fetch taxa/AVES, but not taxa/AVES/bibliography or taxa/AVES/complete . Regrettably, robots.txt does not support globbing, so we can’t put the rule in there. We can jam it in the reverse proxy, or in the AFD app itself.

This done, we can relax the download limits so that legitimate users can get their names lists.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: