Oct 13, 2012

Last.fm open source commons libraries

We recently open-sourced a few our our common libraries at Last.fm. This was primarily because we had a number of larger projects that we wanted to make public such as Moji - our Java MogileFS client. These projects have a dependencies on convenience classes in our common libraries. Rather than duplicate these classes we decided to also make the effort to open source the libraries themselves.

These libraries contain a few classes that I find particularly useful when writing tests:


This library contains a nice system clock abstraction named (unsurprisingly) 'Clock' which was inspired by this StackOverflow answer. A clock instance allows you to get access to various forms of the current time. So far, millis, nanos, java.util.Datejava.util.Calendar, and org.joda.time.DateTime instances are supported. The class does seem too simple to be of any value - after all it just delegates to some standard Java API methods. However, the real power of the abstraction comes when unit testing time dependent code that you may have.

In your unit tests you now have the ability to swap out the actual system clock instance with a mock or a stub and take control of the 'current time' - at least as far as the class under test is concerned. This can be very useful as time dependent tests that use the system clock directly can often suffer from the following problems:

Long running tests
You may have some code that should change some state once a minute has elapsed. If using the system clock directly you might call Thread.sleep() to let allow the system clock to advance:

  assertThat(myClass.getState(), is(/* changed state */));

This will work, but will also result in a test that take no less than one minute to execute. With more tests such delays are compounded and we might end up with a test suite that takes some considerable time to execute. We want tests to be fast so that they are unnoticeable and easy to run. With a mockable clock we could instead write the following and be free from long delays:

    0L, TimeUnit.MINUTES.toMillis(1));
  assertThat(myClass.getState(), is(/* changed state */));

Delays in tests are brittle
You might have another test on a class that should only change state if more than 5 seconds but less than 10 seconds has passed. The problem here is that Thread.sleep() does not guarantee the exact duration of the delay. If your machine is busy the system clock might easily tick by 5 seconds before your thread on which your test is running is rescheduled and your assertion will fail. I have seen this frequently with tests running on continuous-integration machines - tests may run fine for months but if the CI machine is particularly busy, brittle time dependent tests will suddenly start failing. As shown in the previous example - a clock abstraction can removed the need for such delays.

The future
You may wish to test the class for future moments in time and unless you have a time machine - this just isn't practical with the system clock alone. However, you can set a mock or stub clock to whatever time you wish.


We really like using JUnit's TemporaryFolder rule for cleanly managing temporary folder and files in our tests. I automatically handles the tear down of the temp fails and removes the need for unsightly path strings in the test code, which in the worst case often end up in a shared constant somewhere. We applied a similar approach to the handling of data files for tests using a DataFolder abstraction. Assuming our standard Maven project layout, implementations allow simple, clean and scoped access to test data. The data folders also allow us to organise test data by test class or even test method without having to worry about the underlying file path. Some examples:

Top level data folder

  public class MyTest {

    public DataFolder dataFolder = new RootDataFolder();

    public void myTestMethod() throws IOException {
      File actualFolder = dataFolder.getFolder();
      // Path: ./src/test/data

Top level data folder with children

  public class MyMp3Test {

    public DataFolder dataFolder
      = new RootDataFolder("mp3", "128K", "clean");

    public void myTestMethod() throws IOException {
      File actualFolder = dataFolder.getFolder();
      // Path: ./src/test/data/mp3/128k/clean

Per-class data folder

  public class MyTest {

    public DataFolder dataFolder = new ClassDataFolder();

    public void myTestMethod() throws IOException {
      File actualFolder = dataFolder.getFolder();
      // Path: ./src/test/data/fm/last/project/MyTest

Per-method data folder

  public class MyTest {

    public DataFolder dataFolder = new MethodDataFolder();

    public void myTestMethod() throws IOException {
      File actualFolder = dataFolder.getFolder();
      // Path: ./src/test/data/fm/last/project/MyTest/myTestMethod

Wrap up

Hopefully you may find these libraries or the concepts presented here useful in your own tests.

Oct 12, 2012

Garbage Collector tuning for services with large heaps

I've been working on a Thrift based Java service that uses a large heap. Most of the heap is occupied with a fairly static 7GB dataset that is reloaded every 24 hours. Real-time updates to the data occur throughout the day but they are low in both volume and size, and are incorporated into the next days 'full' dataset. The rest of the heap is made up of very short-lived objects necessary to represent and process client requests to the service. Ideally the service should have a response latency in the order of milliseconds and it met this requirement easily on its initial deployment using just the Concurrent-Mark-Sweep garbage collector.

More recently the traffic has increased by an order of magnitude. In these conditions the default garbage collector configuration couldn't cope and 'stop the world' pauses lasting many seconds were killing performance. We were analysing the garbage collector operation with the following JVM options:


We found that the garbage collector was collecting across the large old generation. However, in the expected usage scenario the old generation should ideally contain only the long lived static dataset, perhaps with the real-time data updates that are delivered throughout the day - nothing should need to be collected from it. The GC logs showed that many more objects were frequently tenured into the old generation only to be later garbage collected. So where were all of these objects coming from? The most likely source were the groups of 'short-lived' objects created per request - if these were being promoted into the old generation then it could grow quickly, requiring frequent long collections. So how can you prevent this from happening?:
  1. Increase the maximum tenuring age to increase the window of opportunity for young objects to be collected before they reach the old generation.
  2. Increase the size of the new generation to lengthen the interval between new space collection cycles which in turn can make it less likely that young objects will be promoted to the old generation.

Increasing the maximum tenure threshold

On each young generation GC cycle, the garbage collector promotes surviving objects from the eden space into a survivor space. There are two survivor spaces and CMS alternates between them on each cycle. There is also a tenure counter on each object in the survivor space that notes the number of GC cycles the object has been there. Every time the survivor space is collected, all objects that remain in the space have their counter incremented. Once an objects counter reaches the maximum tenure age, it is promoted to the old generation.

By keeping objects in the smaller survivor spaces for longer we can hope that objects will have to stick around for a significant amount of time before being promoted to the old generation. In this time we can hope that any truly short-lived objects will be collected long before they hit the maximum tenure count. The default maximum tenure threshold is 4, but it can be configured as high as 15 like so:


Increasing the size of the new generation

Ideally we want all short-lived objects to be collected early on from the new generation. However, if the new space is small then there will be many frequent small collections. When large object throughput is concerned, eden space will fill up quickly - and if the requests are still in progress many objects will enter the survivor spaces as they are still 'in use'. As the collections are frequent, the tenure counters on the objects in the survivor spaces will increment quickly. Should these objects still be 'in use' by the time they reach the maximum tenure age then they'll enter the old generation - something that we want to avoid. To slow this process down we can increase the size of the new space like so:


When increasing the new space we were actually more concerned with increasing eden space. If the time it takes to fill eden (and thus cause a collection) is significantly greater than the average request duration then we could hope that the majority of the newly created objects will be 'dead' by the time a collection starts and are hence immediately eligible for collection from eden. Thanks to the increase in tenuring threshold, those that do survive eden will then have to make it through 15 more GC cycles in the survivor space before they are promoted to the old generation. A arbitrary new allocation of 3GB was chosen as a good starting point.

Monitoring the garbage collector

With these settings in place we wanted to get a good grasp of how they affected the garbage collection profile of my application. Firstly we enabled some extra GC logging to provide insights on the tenuring of objects:


We also started an instance of VisualVM on the same machine and installed the VisualGC plugin. This plugin provides an excellent insight to what's going on inside the garbage collector in real-time and can really aid understanding - especially when compared with the rather obtuse GC log formats.

VisualGC plugin for VisualVM
The VisualGC above shows a snapshot of the GC as the service is load tested. The 7GB dataset is the big green block labelled 'old'. The new collections (dark brown saw-tooth pattern labelled 'eden') are regular as the load is fairly constant, and thus so is the object throughput. It actually takes quite a while for eden to fill up so we could likely decrease the new size from 3GB with out any ill effect. The fact that the survivor spaces (S0, S1, and Histogram) are so empty shows that nearly all per-request objects are disposed of from eden and never make it to the survivor spaces. The 'histogram' section shows that the objects that do make it into survivor fall away rapidly with each cycle so that very little is tenured into the old. This is great because the old space is now effectively static and the GC won't have any cause to try and perform a large collection on it. However, we were curious to see which objects were making it into survivor, so for that we turned to the tenuring distributions in the GC logs (sample edited for brevity):

  2293.760: [GC 2293.760: [ParNew
  Desired survivor size 161054720 bytes, new threshold 15 (max 15)
  - age   1:    8344528 bytes,    8344528 total
  - age   2:    2035920 bytes,   10380448 total
  - age   3:    4351424 bytes,   14731872 total
  - age   4:    4351424 bytes,   14731872 total
  - age  15:    1165424 bytes,   ...

What this log entry shows is that around 10KB of data is making it through to maximum tenuring age. Certainly expected some objects to occasionally reach this point - such as those representing updates to the dataset - however, looking across multiple log entries it was evident that the rate of object promotion to the old space was consistent - perhaps something was leaking? It would be extremely convenient if there was some way of asking the GC 'which objects do you have in tenure bucket 15?' because then we'd then know exactly what is being promoted - unfortunately there is no way of doing so.

Luckily we did have was a 'good guess'. The service makes use of a monitoring component that uses data buckets to aggregate application metrics over time - each bucket typically holds a minutes worth of data and is valid for up to ten minutes. It seemed likely that these buckets would be alive long enough to be promoted to old and - in the worst case - fill up the old space. We decided to rewrite this component to use fixed-length queues that are allocated one on construction. When testing the service with this change the GC logs changed dramatically:

  5755.168: [GC 5755.168: [ParNew
  Desired survivor size 161054720 bytes, new threshold 15 (max 15)
  - age   1:    9638328 bytes,    9638328 total
  - age   2:      79056 bytes,    9717384 total
  - age   3:         24 bytes,    9717408 total
  - age   4:         56 bytes,    9717464 total
  - age  14:       1024 bytes,    9724088 total
  - age  15:         32 bytes,    ...

This log shows an extremely low movement of objects from the new space to the old.

So, we were able to use VisualGC and GC debug logs to provide insight to the garbage collection process. This helped us tune our application so that long pauses were effectively eradicated and latency targets could be reliably met. Of course, a full GC will happen eventually, but our architecture conveniently allows for this to be on our terms (when running under normal conditions). Different architectures may require alternative garbage collection strategies, but I recommend profiling with VisualGC early on and then using the GC logs to provide greater detail.

I must offer my thanks to: my colleague James for pairing up with me on this problem!

Oct 8, 2012

Setting ulimits for jsvc

One of my Thrift based Java services was recently rolled out to a new set of clients, resulting in a 2000% increase in requests. This was all planned for and the transition was 'fairly' smooth. However, I noticed that the service would run out file descriptors periodically. This would result in the following exception:

  java.io.IOException: Too many open files

Now when we had planned to deploy the service we had considered the number of concurrent connections and set the open file limit for the user under which the service would run. So, the presence of this error was quite confusing. I initially suspected a file leak so started graphing the number of files the process had open using a simple script to capture the data - note that as I'm using jsvc, it's fairly easy to identify the process ID:

  PID=`cat /var/run/service-name/service-name.pid`
  echo `date` $PID `ls -la /proc/$PID/fd | wc -l`

Sure enough I found a leak but it was tiny and wouldn't account for the resource usage that I was seeing. I identified the source of the leak by correlating it with events on other monitoring graphs and fixed it anyway. But the graph showed something far more interesting - the file descriptors would max out consistently at ~1000 - a value far below our configured limit. I then remembered that jsvc is actually started by the root user and then drops down to the service user - perhaps it was using the open file limit for root - it looked suspiciously so:

  root@server:/ ulimit -n

This was exactly the same behaviour I had seen when configuring large pages - which was unfortunately forgotten! Resource limits for jsvc based processes must be set for the user with which jsvc is started - not the user it downgrades to. This is a known issue that has been raised with Apache. For a quick workaround I set the limit in my init.d script like but this feels a bit hidden and I worry that this approach would not work well when running multiple services with their own particular resource limits.

  ulimit -n 15000

Fortunately I don't have to worry about that right now but it would be more convenient to assign a user per service and then have user based limits. Hopefully I'll now remember this jsvc behaviour so I don't get caught out by it a third time.

Oct 6, 2012

Investigating unexpected GC pauses

I've recently had a lot of fun learning about GC profiling. I have an service that has a fairly large heap (14G). The majority if the heap comprises an effectively static dataset that is served to clients. The application has beep performing very well using the Concurrent-Mark-Sweep collector. However, I was observing something quite strange on our dashboards - all client requests would timeout over a very short period every hour - and this occurred on every single node in the cluster. After much investigation of cron jobs, networking, etc. I decided to see what the garbage collector was doing. I had initially discounted the GC as I reasoned that CMS could not be responsible for such periodic behaviour given that pressure on memory was low and the load on the service varied throughout the day. I enabled GC logging like so:


Unexpectedly, when the next 'timeout' period occurred I saw:

    2012-10-05T17:20:06.498+0000: [Full GC (System) ... ]

So, the system was doing a full 'stop the world' garbage collection cycle. Now although CMS can and will run a full GC occasionally, the periodic nature of the collections didn't quite fit - I suspected something was calling System.gc() every hour. To investigate this I first hit Ctrl-Alt-H on the method in Eclipse to find any invocations on the classpath - this didn't yield anything obvious. I then hooked up the service to a debugger, set a breakpoint on System.gc() and waited. Now, I was hoping to find that some thread running in one of my 3rd party libraries was abusing the garbage collector, but when the breakpoint hit I was instead directed to:


This is the default garbage collector thread and hence looked like a bit of a dead end. I had a look at the code and found that the GC interval is controlled by a field private static long latencyTarget - suspiciously this had a runtime value of 3600000ms (1 hour). So I had found the cause of the System.gc() invocations but still did not know what was configuring the sun.misc.GC$Daemon to operate every hour. As an initial workaround I used the following JVM option to override the direct GC invocations:


However, this didn't make me that happy as part of our application benefits from an explicit System.gc() call. Once a day clients are refused the service and a new 7GB dataset is loaded into memory. The process of loading the data can create a lot of garbage so it is nice to be able to run a full GC before it starts accepting clients again. The -XX:-DisableExplicitGC option disables this so any garbage is going to sit around and has to be collected when we start serving clients again - not ideal.

Thankfully my colleague James found that some RMI components that we use in our JMX management component interact with sun.misc.GC. The sun.rmi.transport.ObjectTable class suspiciously requests a garbage collection latency of 1 hour by invoking the method sun.misc.GC.requestLatency. Further examination of the class (lines 38, 274) showed that this interval is configurable with the following system property:


By setting these properties to sun.misc.GC.NO_TARGET (== Long.MAX_VALUE == 9223372036854775807L) the periodic call to System.gc() is avoided. With this arcane property value I no longer need -XX:-DisableExplicitGC and am free of the hourly application freeze. So the lesson here is to cautiously consider garbage collection side effects of any JDK or 3rd party libraries that your are using.

Jun 28, 2012

Configuring the Sun JVM to use huge pages on Debian

I have a Java service that currently runs with a 20GB heap. I was keen to try out the -XX:+UseLargePages option to see how this might affect the performance of the system. I configured the Debian OS as described by Oracle using appropriate shared memory and page values (these can also be calculated with a handy online tool).

However, I had great difficulty starting the JVM with the (sparsely) documented settings. I could see that the OS allocated the expected amount of memory as huge-pages. However, starting the VM with the -XX:+UseLargePages option set always resulted in one of the following errors:

When -Xmx was almost equal to the huge page allocation:
  • Failed to reserve shared memory (errno = 28). // 'No space left on device'

When -Xmx is less than the huge page allocation:
  • Failed to reserve shared memory (errno = 12). // 'Out of memory'

A helpful stackoverflow answer suggested that I apply another kernel parameter that is seemingly not documented in any of the standard -XX:+UseLargePages resources:
  • /proc/sys/kernel/shmall
This parameter must be set to an equivalent size as /proc/sys/kernel/shmmax but expressed in 4KB pages units, so for a shmmax size of 25769803776 shmall should be 6291456. This solved some of my issues and I could test the correct instantiation of the JVM with:
  • java -XX:+UseLargePages -Xmx20g -version
However, my application is executed by jsvc from Apache commons-daemon, and when starting it in this way the memory allocation problem remained. I hadn't expected this as the executing user had appropriate hugetlb group membership and adequate memlock limits in declared in /etc/security/limits.conf:
  • appuser soft memlock 25165824
  • appuser hard memlock 25165824
Now although jsvc is normally started by the root user, it switches to a configured exec user. Some experimentation showed that appuser's memlock limits where not applied when jsvc was switching from root to appuser and hence memory could not be allocated to the JVM. To get around this I also added root to the hugetlb group and provided some additional memlock limits for the root user:
  • root soft memlock 25165824
  • root hard memlock 25165824
Now everything works as expected.