So I had Sally run AFD, then I ran 23 wget processes to slurp AFD from localhost. What I found was that the thing was jamming when wget asked for – for instance – “all host taxa for Nematoda”, or “The bibliography for all of Mammalia”. So I instituted some savage limits on those links and redeployed it to prod with some additional logging to report the user agent for calls asking for big downloads.
And what do you know. Every one of them was a spider. All of ’em. Mostly bing.com, and a few others I don’t recognise.
Way forward: keep certain user agents away from those pages in particular. It’s ok for them to fetch taxa/AVES, but not taxa/AVES/bibliography or taxa/AVES/complete . Regrettably, robots.txt does not support globbing, so we can’t put the rule in there. We can jam it in the reverse proxy, or in the AFD app itself.
This done, we can relax the download limits so that legitimate users can get their names lists.