Oct 6, 2012

Investigating unexpected GC pauses

I've recently had a lot of fun learning about GC profiling. I have an service that has a fairly large heap (14G). The majority if the heap comprises an effectively static dataset that is served to clients. The application has beep performing very well using the Concurrent-Mark-Sweep collector. However, I was observing something quite strange on our dashboards - all client requests would timeout over a very short period every hour - and this occurred on every single node in the cluster. After much investigation of cron jobs, networking, etc. I decided to see what the garbage collector was doing. I had initially discounted the GC as I reasoned that CMS could not be responsible for such periodic behaviour given that pressure on memory was low and the load on the service varied throughout the day. I enabled GC logging like so:


Unexpectedly, when the next 'timeout' period occurred I saw:

    2012-10-05T17:20:06.498+0000: [Full GC (System) ... ]

So, the system was doing a full 'stop the world' garbage collection cycle. Now although CMS can and will run a full GC occasionally, the periodic nature of the collections didn't quite fit - I suspected something was calling System.gc() every hour. To investigate this I first hit Ctrl-Alt-H on the method in Eclipse to find any invocations on the classpath - this didn't yield anything obvious. I then hooked up the service to a debugger, set a breakpoint on System.gc() and waited. Now, I was hoping to find that some thread running in one of my 3rd party libraries was abusing the garbage collector, but when the breakpoint hit I was instead directed to:


This is the default garbage collector thread and hence looked like a bit of a dead end. I had a look at the code and found that the GC interval is controlled by a field private static long latencyTarget - suspiciously this had a runtime value of 3600000ms (1 hour). So I had found the cause of the System.gc() invocations but still did not know what was configuring the sun.misc.GC$Daemon to operate every hour. As an initial workaround I used the following JVM option to override the direct GC invocations:


However, this didn't make me that happy as part of our application benefits from an explicit System.gc() call. Once a day clients are refused the service and a new 7GB dataset is loaded into memory. The process of loading the data can create a lot of garbage so it is nice to be able to run a full GC before it starts accepting clients again. The -XX:-DisableExplicitGC option disables this so any garbage is going to sit around and has to be collected when we start serving clients again - not ideal.

Thankfully my colleague James found that some RMI components that we use in our JMX management component interact with sun.misc.GC. The sun.rmi.transport.ObjectTable class suspiciously requests a garbage collection latency of 1 hour by invoking the method sun.misc.GC.requestLatency. Further examination of the class (lines 38, 274) showed that this interval is configurable with the following system property:


By setting these properties to sun.misc.GC.NO_TARGET (== Long.MAX_VALUE == 9223372036854775807L) the periodic call to System.gc() is avoided. With this arcane property value I no longer need -XX:-DisableExplicitGC and am free of the hourly application freeze. So the lesson here is to cautiously consider garbage collection side effects of any JDK or 3rd party libraries that your are using.

1 comment:

  1. To make it clear you can add -Dsun.rmi.dgc.server.gcInterval=9223372036854775807 to your Java command line to make RMI not cause an hourly GC.

    For the insite it gives here is the fragment from the start script.

    # Believe it or not, this is ridiculously fucking important - RMI (used by
    # our JMX) basically configures an hourly full GC if this is not set.
    JAVA_OPTS="$JAVA_OPTS -Dsun.rmi.dgc.server.gcInterval=$JAVA_LONG_MAX_VALUE"
    JAVA_OPTS="$JAVA_OPTS -Dsun.rmi.dgc.client.gcInterval=$JAVA_LONG_MAX_VALUE"