Goddammit, the AFD public pages are leaking memory.
Only in production, of course. We have a cron job to reboot it every day at 11PM. It’s not enough – only lasts 10-12 hours before blowing up again.
I tightened up the memory scavenging and sharing in the servlet that fetches and caches maps from the geoserver. Really thought this would fix it. Didn’t work – AFD continued blowing up.
I added a logging filter to spit out, at five-second intervals, what pages were being fetched and how long they were taking. I found that usually, there were about 8-12 sessions max at any particular moment, but just before a blow-up the number of concurrent sessions would shoot up.
So I added a filter that would count the number of active requests and throw a 503 if there were more than 20. Didn’t work – AFD continued blowing up.
I got my logging filter to spit out stack traces. Found that there were suspicious things relating to oracle connections. So I tweaked the oracle connection properties. Removed the max session age limit, because the AFD public app is read-only (the age limit was 10 hours, which was suspiciously similar to the interval between blow-ups). Added an item to cache prepared statements.
After this, the performance of the pages dramatically improved. Maybe I’m just fooling myself – I didn’t actually take timings. But it feels a hell of a lot snappier. I imagine that caching the prepared statements means that JDBC no longer asks oracle “what columns does this return and what are their types?” when it runs a stored procedure it already knows about. Big, big savings.
But – AFD continued blowing up.
So what the hell do I do now?
Maybe hibernate is caching things – not completing the transaction, leaving things lying around. A filter to flush hibernate once the page is done might fix things.
Add something to force a gc every now and then?
Revisit image caching
I use the usual pattern to cache images – a WeakHashMap of soft references to objects holding the hashmap key and the cached object. While developing this, I added some logging and determined that yes – the references were getting cleared. But when I attached phantom references, I found that the phantom references never got enqueued. Maybe I misunderstood phantom references. Maybe there really is a problem – perhaps a manual gc is needed.
A potential fix would be to supplement the soft reference behaviour with something that manually drops images from the cache.
Fix bulk CSV
Browsing through some of the old code, I found something very, very nasty. Some of the bulk csv pages operate by collecting the page into a StringBuffer and then spitting it out. This is a pending bug – maybe it’s time to address it.
Enable JVM debugging in PROD
Maybe I’d get a picture of what, exactly, is clogging up memory if I could get onto the JVM with a debugger.
Duplicate the problem in DEV, and debug it there
You know, this is what I ought to do. Rather than throwing things at PROD. DEV is my desktop machine, Sally. How to duplicate the problem. though? Well – how about running a dozen copies of wget, pointing them at localhost, and maybe cutting tomcat memory right down.
Ok. I will attempt to have Sally duplicate the behaviour. Point debugger at local tomcat and find out what java classes are hogging the memory.
Thanks for the advice.