AFD Leaking

Goddammit, the AFD public pages are leaking memory.

Only in production, of course. We have a cron job to reboot it every day at 11PM. It’s not enough – only lasts 10-12 hours before blowing up again.

I tightened up the memory scavenging and sharing in the servlet that fetches and caches maps from the geoserver. Really thought this would fix it. Didn’t work – AFD continued blowing up.

I added a logging filter to spit out, at five-second intervals, what pages were being fetched and how long they were taking. I found that usually, there were about 8-12 sessions max at any particular moment, but just before a blow-up the number of concurrent sessions would shoot up.

So I added a filter that would count the number of active requests and throw a 503 if there were more than 20. Didn’t work – AFD continued blowing up.

I got my logging filter to spit out stack traces. Found that there were suspicious things relating to oracle connections. So I tweaked the oracle connection properties. Removed the max session age limit, because the AFD public app is read-only (the age limit was 10 hours, which was suspiciously similar to the interval between blow-ups). Added an item to cache prepared statements.

After this, the performance of the pages dramatically improved. Maybe I’m just fooling myself – I didn’t actually take timings. But it feels a hell of a lot snappier. I imagine that caching the prepared statements means that JDBC no longer asks oracle “what columns does this return and what are their types?” when it runs a stored procedure it already knows about. Big, big savings.

But – AFD continued blowing up.

So what the hell do I do now?

Flush Hibernate

Maybe hibernate is caching things – not completing the transaction, leaving things lying around. A filter to flush hibernate once the page is done might fix things.

Flush Memory

Add something to force a gc every now and then?

Revisit image caching

I use the usual pattern to cache images – a WeakHashMap of soft references to objects holding the hashmap key and the cached object. While developing this, I added some logging and determined that yes – the references were getting cleared. But when I attached phantom references, I found that the phantom references never got enqueued. Maybe I misunderstood phantom references. Maybe there really is a problem – perhaps a manual gc is needed.

A potential fix would be to supplement the soft reference behaviour with something that manually drops images from the cache.

Fix bulk CSV

Browsing through some of the old code, I found something very, very nasty. Some of the bulk csv pages operate by collecting the page into a StringBuffer and then spitting it out. This is a pending bug – maybe it’s time to address it.

Enable JVM debugging in PROD

Maybe I’d get a picture of what, exactly, is clogging up memory if I could get onto the JVM with a debugger.

Duplicate the problem in DEV, and debug it there

You know, this is what I ought to do. Rather than throwing things at PROD. DEV is my desktop machine, Sally. How to duplicate the problem. though? Well – how about running a dozen copies of wget, pointing them at localhost, and maybe cutting tomcat memory right down.

Way forward

Ok. I will attempt to have Sally duplicate the behaviour. Point debugger at local tomcat and find out what java classes are hogging the memory.

Thanks for the advice.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: