Swish Movement

Jan 17, 2013

Distributed transactions with Spring

Warning: The application design I describe below is not one that I'd recommend. This post concerns slight improvements to a monster of a legacy project that has some technical debt. Specifically I'm not at all happy with the following:

It uses a mix of ORM technologies
It is multi-VM where it should be multi-threaded
The current design requires distributed transactions whereas a better design might not

End of warning.

I’ve been working on some improvements for a legacy data loading application. It basically takes in a bunch of XML in different formats and updates a couple of separate databases with the transformed data. One database maintains a history of the loaded data, the other a current snapshot. It is essential that the entities referenced in these databases are consistent with one another.

Now, given that there are two databases that are both being updated and that some keys in database A are referenced in database B, I was expecting to see some kind of distributed transaction control. However, some sample test cases showed that this was not the case - a rollback on one database would not initiate a rollback on the other - leaving the databases in an inconsistent state.

A simple solution would be to use a JTA/XA transaction manager to provide a robust distributed transactions with 2 phase commits. I’ve use XA in the distant past where the transaction management is provided by a J2EE container such as WebLogic or JBoss, I seem to recall it being fairly straightforward - pretty much checking an ‘XA’ checkbox on each data source. However, the application in question has no such container and uses Spring instead - I was not keen to introduce a container just to fix this problem.

Fortunately there are a couple of open source JTA providers that are supposed to play nicely with Spring - the most visible being Atomikos TransactionEssentials and Bitronix BTM. These solutions are also compatible with Hibernate and iBatis - another requirement as both of these technologies are used.

Atomikos TransactionEssentials

TransactionEssentials seemed to be the most popular open source JTA provider, so I gave this a go first. Integrating it was relatively straightforward:

Create a couple of Atomikos beans and a property file
Switch from the HibernateTransactionManager to the JtaTransationManager
Set the data source implementations to the respective XA proxy provided by Atomikos
Configure the database driver class to be an XA variant (org.postgresql.xa.PGXADataSource in my case)
Provide some additional configuration to the Hibernate session factory.

With only these Spring configuration changes I was able to get XA transactions working. The test cases that had previously failed now passed. However, I ran into some annoying problems with our integration test suite. It looked as though the Atomikos datasources registered in one test class were still present when running other test classes. This caused a frequent ‘Another resource already exists with name’ error from Atomikos. Initially I suspected this could be a context caching issue with the SpringJUnit4ClassRunner. However, after carefully applying @DirtiesContext directives to the relevant tests, and changing the Maven SureFire fork mode to ‘always’ the problem persisted. After spending some time trying to get the tests to cooperate I decided to try another JTA provider as I suspected that Atomikos was responsible for the persistence of the data source configuration between tests and that ultimately this was the root of my problem. Also, given that the integration of Atomikos was relatively smooth - I expected that a transition to another JTA provider would be even easier.

Bitronix BTM

Bitronix also seemed fairly popular and looked to be reasonably actively maintained so I chose to try this out next. Integration was trivial this time around:

Swap a set of Atomikos Spring beans with a smaller number of Bitronix beans
Change XA driver classes from Atomikos versions to the Bitronix equivalents
Modify a few driver property names

For those who are interested - I’ve made my Spring/Hibernate/Bitronix configuration available. When I fired up my test suite this time around everything worked perfectly so the use of Bitronix BTM seemed to give a quick win - however, there were some problems were waiting for me when deploying and running the full application.

Avoiding JNDI

Hibernate's default JTATransactionFactory implementation uses JNDI to lookup the UserTransaction. This might make sense for applications running in a container, but we have no need for JNDI in our Spring based configuration. Ultimately I can obtain the UserTransaction instance directly from the application context. To wire this into Hibernate I created a simple JTATransactionFactory that bypasses JNDI with a direct reference to the bean. This has to be instantiated and configured in the context and must be created before the SessionFactory.

Multiple processes using Bitronix on a single machine

Unfortunately, the legacy application in questions runs as multiple separate JVM processes on a single machine. The application really should have been implemented in a multi-threaded way but unfortunately for me - the current maintainer - it wasn’t and a big refactor is not feasible at this time. This multi-process pattern posed a problem for Bitronix as it has some default locations where it stores transaction logs and we don’t want more than one transaction manager updating a given log file. Fortunately, Bitronix can use alternative log name configurations and I was able to set a unique log file name per process:

bitronix.tm.journal.disk.logPart1Filename=/var/run/myapp/${nodeName}-btm1.tlog
bitronix.tm.journal.disk.logPart2Filename=/var/run/myapp/${nodeName}-btm2.tlog

Additional database resources used by XA

The databases in question were both running in PostgreSQL instances. For the vendor XA driver to function the database server's max_prepared_transactions configuration option must be set to a suitable non-zero value (it is 0 by default). This configuration option only takes effect after a restart. In my case, for one of the database servers I’m using a restart is no small matter - so it just wasn’t practical to change this setting in the short-term. However, I still needed to get my fix out to improve the transactional behaviour of the application as soon as possible.

I decided to forego a little robustness and use a Last Resource Commit optimization. This allows for one non-XA resource to participate in the distributed transaction. The transaction manager always commits transactions on this datasource last so that it can still rollback the XA datasources if the commit fails. Implementing the optimization was straightforward - I switched the XA datasource proxy implementation for the database in question to bitronix.tm.resource.jdbc.lrc.LrcXADataSource. With this in place I had everything working again.

Summary

Note that although the integration of JTA/XA can be fairly straightforward, it is essential that you have test cases to verify the expected transactional behaviour of your application. Do not assume that it’ll just work.

Oct 13, 2012

Last.fm open source commons libraries

We recently open-sourced a few our our common libraries at Last.fm. This was primarily because we had a number of larger projects that we wanted to make public such as Moji - our Java MogileFS client. These projects have a dependencies on convenience classes in our common libraries. Rather than duplicate these classes we decided to also make the effort to open source the libraries themselves.

These libraries contain a few classes that I find particularly useful when writing tests:

lastcommons-lang

This library contains a nice system clock abstraction named (unsurprisingly) 'Clock' which was inspired by this StackOverflow answer. A clock instance allows you to get access to various forms of the current time. So far, millis, nanos, java.util.Date, java.util.Calendar, and org.joda.time.DateTime instances are supported. The class does seem too simple to be of any value - after all it just delegates to some standard Java API methods. However, the real power of the abstraction comes when unit testing time dependent code that you may have.

In your unit tests you now have the ability to swap out the actual system clock instance with a mock or a stub and take control of the 'current time' - at least as far as the class under test is concerned. This can be very useful as time dependent tests that use the system clock directly can often suffer from the following problems:

Long running tests
You may have some code that should change some state once a minute has elapsed. If using the system clock directly you might call Thread.sleep() to let allow the system clock to advance:

myClass.initialise();
Thread.sleep(TimeUnit.MINUTES.toMillis(1));
myClass.changeState();
assertThat(myClass.getState(), is(/* changed state */));

This will work, but will also result in a test that take no less than one minute to execute. With more tests such delays are compounded and we might end up with a test suite that takes some considerable time to execute. We want tests to be fast so that they are unnoticeable and easy to run. With a mockable clock we could instead write the following and be free from long delays:

when(mockClock.currentTimeMillis()).thenReturn(
0L, TimeUnit.MINUTES.toMillis(1));
myClass.setClock(mockClock);
myClass.initialise();
myClass.changeState();
assertThat(myClass.getState(), is(/* changed state */));

Delays in tests are brittle
You might have another test on a class that should only change state if more than 5 seconds but less than 10 seconds has passed. The problem here is that Thread.sleep() does not guarantee the exact duration of the delay. If your machine is busy the system clock might easily tick by 5 seconds before your thread on which your test is running is rescheduled and your assertion will fail. I have seen this frequently with tests running on continuous-integration machines - tests may run fine for months but if the CI machine is particularly busy, brittle time dependent tests will suddenly start failing. As shown in the previous example - a clock abstraction can removed the need for such delays.

The future
You may wish to test the class for future moments in time and unless you have a time machine - this just isn't practical with the system clock alone. However, you can set a mock or stub clock to whatever time you wish.

lastcommons-test

We really like using JUnit's TemporaryFolder rule for cleanly managing temporary folder and files in our tests. I automatically handles the tear down of the temp fails and removes the need for unsightly path strings in the test code, which in the worst case often end up in a shared constant somewhere. We applied a similar approach to the handling of data files for tests using a DataFolder abstraction. Assuming our standard Maven project layout, implementations allow simple, clean and scoped access to test data. The data folders also allow us to organise test data by test class or even test method without having to worry about the underlying file path. Some examples:

Top level data folder

public class MyTest {

@Rule
public DataFolder dataFolder = new RootDataFolder();

@Test
public void myTestMethod() throws IOException {
File actualFolder = dataFolder.getFolder();
// Path: ./src/test/data
...

Top level data folder with children

public class MyMp3Test {

@Rule
public DataFolder dataFolder
= new RootDataFolder("mp3", "128K", "clean");

@Test
public void myTestMethod() throws IOException {
File actualFolder = dataFolder.getFolder();
// Path: ./src/test/data/mp3/128k/clean
...

Per-class data folder

public class MyTest {

@Rule
public DataFolder dataFolder = new ClassDataFolder();

@Test
public void myTestMethod() throws IOException {
File actualFolder = dataFolder.getFolder();
// Path: ./src/test/data/fm/last/project/MyTest
...

Per-method data folder

public class MyTest {

@Rule
public DataFolder dataFolder = new MethodDataFolder();

@Test
public void myTestMethod() throws IOException {
File actualFolder = dataFolder.getFolder();
// Path: ./src/test/data/fm/last/project/MyTest/myTestMethod

Wrap up

Hopefully you may find these libraries or the concepts presented here useful in your own tests.

Oct 12, 2012

Garbage Collector tuning for services with large heaps

I've been working on a Thrift based Java service that uses a large heap. Most of the heap is occupied with a fairly static 7GB dataset that is reloaded every 24 hours. Real-time updates to the data occur throughout the day but they are low in both volume and size, and are incorporated into the next days 'full' dataset. The rest of the heap is made up of very short-lived objects necessary to represent and process client requests to the service. Ideally the service should have a response latency in the order of milliseconds and it met this requirement easily on its initial deployment using just the Concurrent-Mark-Sweep garbage collector.

More recently the traffic has increased by an order of magnitude. In these conditions the default garbage collector configuration couldn't cope and 'stop the world' pauses lasting many seconds were killing performance. We were analysing the garbage collector operation with the following JVM options:

-verbose:gc
-Xloggc:/var/log/service-name/gc.log
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps

We found that the garbage collector was collecting across the large old generation. However, in the expected usage scenario the old generation should ideally contain only the long lived static dataset, perhaps with the real-time data updates that are delivered throughout the day - nothing should need to be collected from it. The GC logs showed that many more objects were frequently tenured into the old generation only to be later garbage collected. So where were all of these objects coming from? The most likely source were the groups of 'short-lived' objects created per request - if these were being promoted into the old generation then it could grow quickly, requiring frequent long collections. So how can you prevent this from happening?:

Increase the maximum tenuring age to increase the window of opportunity for young objects to be collected before they reach the old generation.
Increase the size of the new generation to lengthen the interval between new space collection cycles which in turn can make it less likely that young objects will be promoted to the old generation.

Increasing the maximum tenure threshold

On each young generation GC cycle, the garbage collector promotes surviving objects from the eden space into a survivor space. There are two survivor spaces and CMS alternates between them on each cycle. There is also a tenure counter on each object in the survivor space that notes the number of GC cycles the object has been there. Every time the survivor space is collected, all objects that remain in the space have their counter incremented. Once an objects counter reaches the maximum tenure age, it is promoted to the old generation.

By keeping objects in the smaller survivor spaces for longer we can hope that objects will have to stick around for a significant amount of time before being promoted to the old generation. In this time we can hope that any truly short-lived objects will be collected long before they hit the maximum tenure count. The default maximum tenure threshold is 4, but it can be configured as high as 15 like so:

-XX:MaxTenuringThreshold=15

Increasing the size of the new generation

Ideally we want all short-lived objects to be collected early on from the new generation. However, if the new space is small then there will be many frequent small collections. When large object throughput is concerned, eden space will fill up quickly - and if the requests are still in progress many objects will enter the survivor spaces as they are still 'in use'. As the collections are frequent, the tenure counters on the objects in the survivor spaces will increment quickly. Should these objects still be 'in use' by the time they reach the maximum tenure age then they'll enter the old generation - something that we want to avoid. To slow this process down we can increase the size of the new space like so:

-Xmn3g

When increasing the new space we were actually more concerned with increasing eden space. If the time it takes to fill eden (and thus cause a collection) is significantly greater than the average request duration then we could hope that the majority of the newly created objects will be 'dead' by the time a collection starts and are hence immediately eligible for collection from eden. Thanks to the increase in tenuring threshold, those that do survive eden will then have to make it through 15 more GC cycles in the survivor space before they are promoted to the old generation. A arbitrary new allocation of 3GB was chosen as a good starting point.

Monitoring the garbage collector

With these settings in place we wanted to get a good grasp of how they affected the garbage collection profile of my application. Firstly we enabled some extra GC logging to provide insights on the tenuring of objects:

-XX:+PrintTenuringDistribution

We also started an instance of VisualVM on the same machine and installed the VisualGC plugin. This plugin provides an excellent insight to what's going on inside the garbage collector in real-time and can really aid understanding - especially when compared with the rather obtuse GC log formats.

VisualGC plugin for VisualVM

The VisualGC above shows a snapshot of the GC as the service is load tested. The 7GB dataset is the big green block labelled 'old'. The new collections (dark brown saw-tooth pattern labelled 'eden') are regular as the load is fairly constant, and thus so is the object throughput. It actually takes quite a while for eden to fill up so we could likely decrease the new size from 3GB with out any ill effect. The fact that the survivor spaces (S0, S1, and Histogram) are so empty shows that nearly all per-request objects are disposed of from eden and never make it to the survivor spaces. The 'histogram' section shows that the objects that do make it into survivor fall away rapidly with each cycle so that very little is tenured into the old. This is great because the old space is now effectively static and the GC won't have any cause to try and perform a large collection on it. However, we were curious to see which objects were making it into survivor, so for that we turned to the tenuring distributions in the GC logs (sample edited for brevity):

2293.760: [GC 2293.760: [ParNew
Desired survivor size 161054720 bytes, new threshold 15 (max 15)
- age 1: 8344528 bytes, 8344528 total
- age 2: 2035920 bytes, 10380448 total
- age 3: 4351424 bytes, 14731872 total
- age 4: 4351424 bytes, 14731872 total
...
- age 15: 1165424 bytes, ...

What this log entry shows is that around 10KB of data is making it through to maximum tenuring age. Certainly expected some objects to occasionally reach this point - such as those representing updates to the dataset - however, looking across multiple log entries it was evident that the rate of object promotion to the old space was consistent - perhaps something was leaking? It would be extremely convenient if there was some way of asking the GC 'which objects do you have in tenure bucket 15?' because then we'd then know exactly what is being promoted - unfortunately there is no way of doing so.

Luckily we did have was a 'good guess'. The service makes use of a monitoring component that uses data buckets to aggregate application metrics over time - each bucket typically holds a minutes worth of data and is valid for up to ten minutes. It seemed likely that these buckets would be alive long enough to be promoted to old and - in the worst case - fill up the old space. We decided to rewrite this component to use fixed-length queues that are allocated one on construction. When testing the service with this change the GC logs changed dramatically:

5755.168: [GC 5755.168: [ParNew
Desired survivor size 161054720 bytes, new threshold 15 (max 15)
- age 1: 9638328 bytes, 9638328 total
- age 2: 79056 bytes, 9717384 total
- age 3: 24 bytes, 9717408 total
- age 4: 56 bytes, 9717464 total
...
- age 14: 1024 bytes, 9724088 total
- age 15: 32 bytes, ...

This log shows an extremely low movement of objects from the new space to the old.

So, we were able to use VisualGC and GC debug logs to provide insight to the garbage collection process. This helped us tune our application so that long pauses were effectively eradicated and latency targets could be reliably met. Of course, a full GC will happen eventually, but our architecture conveniently allows for this to be on our terms (when running under normal conditions). Different architectures may require alternative garbage collection strategies, but I recommend profiling with VisualGC early on and then using the GC logs to provide greater detail.

I must offer my thanks to: my colleague James for pairing up with me on this problem!

Oct 8, 2012

Setting ulimits for jsvc

One of my Thrift based Java services was recently rolled out to a new set of clients, resulting in a 2000% increase in requests. This was all planned for and the transition was 'fairly' smooth. However, I noticed that the service would run out file descriptors periodically. This would result in the following exception:

java.io.IOException: Too many open files

Now when we had planned to deploy the service we had considered the number of concurrent connections and set the open file limit for the user under which the service would run. So, the presence of this error was quite confusing. I initially suspected a file leak so started graphing the number of files the process had open using a simple script to capture the data - note that as I'm using jsvc, it's fairly easy to identify the process ID:

#!/bin/bash

PID=`cat /var/run/service-name/service-name.pid`

echo `date` $PID `ls -la /proc/$PID/fd | wc -l`

Sure enough I found a leak but it was tiny and wouldn't account for the resource usage that I was seeing. I identified the source of the leak by correlating it with events on other monitoring graphs and fixed it anyway. But the graph showed something far more interesting - the file descriptors would max out consistently at ~1000 - a value far below our configured limit. I then remembered that jsvc is actually started by the root user and then drops down to the service user - perhaps it was using the open file limit for root - it looked suspiciously so:

root@server:/ ulimit -n

1024

This was exactly the same behaviour I had seen when configuring large pages - which was unfortunately forgotten! Resource limits for jsvc based processes must be set for the user with which jsvc is started - not the user it downgrades to. This is a known issue that has been raised with Apache. For a quick workaround I set the limit in my init.d script like but this feels a bit hidden and I worry that this approach would not work well when running multiple services with their own particular resource limits.

ulimit -n 15000

Fortunately I don't have to worry about that right now but it would be more convenient to assign a user per service and then have user based limits. Hopefully I'll now remember this jsvc behaviour so I don't get caught out by it a third time.

Oct 6, 2012

Investigating unexpected GC pauses

I've recently had a lot of fun learning about GC profiling. I have an service that has a fairly large heap (14G). The majority if the heap comprises an effectively static dataset that is served to clients. The application has beep performing very well using the Concurrent-Mark-Sweep collector. However, I was observing something quite strange on our dashboards - all client requests would timeout over a very short period every hour - and this occurred on every single node in the cluster. After much investigation of cron jobs, networking, etc. I decided to see what the garbage collector was doing. I had initially discounted the GC as I reasoned that CMS could not be responsible for such periodic behaviour given that pressure on memory was low and the load on the service varied throughout the day. I enabled GC logging like so:

-verbose:gc
-Xloggc:/var/log/service-name/gc.log
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps

Unexpectedly, when the next 'timeout' period occurred I saw:

2012-10-05T17:20:06.498+0000: [Full GC (System) ... ]

So, the system was doing a full 'stop the world' garbage collection cycle. Now although CMS can and will run a full GC occasionally, the periodic nature of the collections didn't quite fit - I suspected something was calling System.gc() every hour. To investigate this I first hit Ctrl-Alt-H on the method in Eclipse to find any invocations on the classpath - this didn't yield anything obvious. I then hooked up the service to a debugger, set a breakpoint on System.gc() and waited. Now, I was hoping to find that some thread running in one of my 3rd party libraries was abusing the garbage collector, but when the breakpoint hit I was instead directed to:

sun.misc.GC$Daemon

This is the default garbage collector thread and hence looked like a bit of a dead end. I had a look at the code and found that the GC interval is controlled by a field private static long latencyTarget - suspiciously this had a runtime value of 3600000ms (1 hour). So I had found the cause of the System.gc() invocations but still did not know what was configuring the sun.misc.GC$Daemon to operate every hour. As an initial workaround I used the following JVM option to override the direct GC invocations:

-XX:-DisableExplicitGC

However, this didn't make me that happy as part of our application benefits from an explicit System.gc() call. Once a day clients are refused the service and a new 7GB dataset is loaded into memory. The process of loading the data can create a lot of garbage so it is nice to be able to run a full GC before it starts accepting clients again. The -XX:-DisableExplicitGC option disables this so any garbage is going to sit around and has to be collected when we start serving clients again - not ideal.

Thankfully my colleague James found that some RMI components that we use in our JMX management component interact with sun.misc.GC. The sun.rmi.transport.ObjectTable class suspiciously requests a garbage collection latency of 1 hour by invoking the method sun.misc.GC.requestLatency. Further examination of the class (lines 38, 274) showed that this interval is configurable with the following system property:

sun.rmi.dgc.server.gcInterval

By setting these properties to sun.misc.GC.NO_TARGET (== Long.MAX_VALUE == 9223372036854775807L) the periodic call to System.gc() is avoided. With this arcane property value I no longer need -XX:-DisableExplicitGC and am free of the hourly application freeze. So the lesson here is to cautiously consider garbage collection side effects of any JDK or 3rd party libraries that your are using.

Jun 28, 2012

Configuring the Sun JVM to use huge pages on Debian

I have a Java service that currently runs with a 20GB heap. I was keen to try out the -XX:+UseLargePages option to see how this might affect the performance of the system. I configured the Debian OS as described by Oracle using appropriate shared memory and page values (these can also be calculated with a handy online tool).

However, I had great difficulty starting the JVM with the (sparsely) documented settings. I could see that the OS allocated the expected amount of memory as huge-pages. However, starting the VM with the -XX:+UseLargePages option set always resulted in one of the following errors:

When -Xmx was almost equal to the huge page allocation:

Failed to reserve shared memory (errno = 28). // 'No space left on device'

When -Xmx is less than the huge page allocation:

Failed to reserve shared memory (errno = 12). // 'Out of memory'

A helpful stackoverflow answer suggested that I apply another kernel parameter that is seemingly not documented in any of the standard -XX:+UseLargePages resources:

/proc/sys/kernel/shmall

This parameter must be set to an equivalent size as /proc/sys/kernel/shmmax but expressed in 4KB pages units, so for a shmmax size of 25769803776 shmall should be 6291456. This solved some of my issues and I could test the correct instantiation of the JVM with:

java -XX:+UseLargePages -Xmx20g -version

However, my application is executed by jsvc from Apache commons-daemon, and when starting it in this way the memory allocation problem remained. I hadn't expected this as the executing user had appropriate hugetlb group membership and adequate memlock limits in declared in /etc/security/limits.conf:

appuser soft memlock 25165824
appuser hard memlock 25165824

Now although jsvc is normally started by the root user, it switches to a configured exec user. Some experimentation showed that appuser's memlock limits where not applied when jsvc was switching from root to appuser and hence memory could not be allocated to the JVM. To get around this I also added root to the hugetlb group and provided some additional memlock limits for the root user:

root soft memlock 25165824
root hard memlock 25165824

Now everything works as expected.

Aug 22, 2011

Update: Lion sleep woes solved

In my previous post I was complaining how my MacBook would no longer sleep. The sleep function worked well after installing Lion but then stopped working unexpectedly.
It turns out that this is not an uncommon problem with many people complaining of the same issue. Advice on how to fix it varies wildly with solutions including (but not limited to): SMC reset, PRAM reset, turn off preference X, delete folder Y - there are even 3rd party application to force your Mac to sleep.

However the advice that worked for me was actually fairly logical:

1. Use pmset to view the power management settings:

localhost:work teabot$ pmset -g
    ...
    disksleep	10
    sleep		0 (imposed by 46)
    hibernatemode	3
    ...

The interesting thing in the output is the line that states 'sleep 0' - presumably indicating that sleep mode is disabled. Further along this line we find 'imposed by 46'. This is actually telling us the id of the process (PID) that is blocking sleep mode.

2. List the process with the matching PID:

localhost:work teabot$ ps -ef | grep 46
    0  46  1 0 9:05pm ??  0:00.16 /usr/sbin/cupsd -l
    0 107 46 0 9:05pm ??  0:00.59 ipp://nano.local.:631/printers/HP_Photosmart_C4100_series 7...
    ...

So we can see that a queued print job is inhibiting sleep mode and in this case I resolved the issue by clearing my print queue at which point my MacBook was happy to sleep once more.

It's pretty bizarre that a queued print job should disable sleep mode. The action of me closing the MacBook lid should make clear my intention of wanting the computer to sleep. If the print job is so important - perhaps a warning dialog should be issued so that I might have some idea why sleep mode will not be activated?

I assume that other types of process may inhibit sleep mode. Thankfully the pmset command includes some unexpectedly useful output to help us figure out which process is responsible. I can't imagine how long it would've taken me to find the errant print job without it.

More OSX Lion woes

I having more small yet significant problems with OSX Lion. I'm running the latest version - 10.7.1 - on both my MacPro and MacBook.

My MacPro:
I find that I frequently lose the ability to resolve DNS names. This is only resolved by restarting the machine. Some Googling suggests that this issue may be due to a failing mDNSResponser process. I use this machine as a media server, so in practice I get up in the morning and try to tune some Internet radio only to find that I need to reboot my media server before I can do so. I'm going to try killing the process so that I don't have to restart the machine each time.

My MacBook:
Sometimes my MacBook just won't sleep. Closing the lid, Cmd-Alt-Eject, and the Apple menu option have no effect. It's really frustrating. Some people report that this can happen if Internet Sharing is enabled - it's not in my case. Apple's 'sleep' implementation used to be its crowning jewel. Coupled with OSX's stability I could: go months without rebooting my laptop, have a working system within seconds of opening the lid, happily sling my MacBook into a bag just moments after shutting the lid. I can't anymore...

When I wake this machine from sleep - when I can actually get it to sleep that is, I often find that the wireless card is in an unknown state. This makes it impossible to connect to a wireless network or switch the wireless off. The machine seems to sort itself out after a few minutes but it's quite annoying - especially if I'm trying to look something up in a hurry.

Summary: With this extremely poor OSX Lion release Apple has reminded us what it was like to run Windows XP on a cheap laptop in 2002.

Aug 10, 2011

Inheriting Maven profiles

I'm in the process of moving a number of projects from Ant/Ivy to Maven. Many of our projects use Thrift and we use a Maven Thrift plugin to conveniently compile our Thrift IDL. Given that all of our Thrift projects have the same setup, the plugin is declared in a our parent POM.
The plugin is declared in a profile that is only activated when a thrift source directory is present:

  <profile>
    <id>uses-thrift</id>
    <activation>
      <file>
        <exists>${basedir}/src/main/thrift</exists>
      </file>
    </activation>
    <build>
      <plugins>
        <plugin>
          <groupId>org.apache.thrift.tools</groupId>
          <artifactId>maven-thrift-plugin</artifactId>
          ...

I actually have a number of plugins defined in such profiles so that they become active only when relevant source artefacts are present in the sub-project.

This all looks good, but in practice I found that the profiles were never activated in sub-modules that were being built from a parent project. It was as if the activation state was determined at the parent level, not on a per module basis.

It seems that profile inheritance does function as I expected in Maven 2. It does however work perfectly with Maven 3.

Jul 22, 2011

AFP failing between OSX Lion machines

I recently upgraded both my MacPro and MacBook to Lion. The upgrades went fine, but I now find that I cannot open a non-guest AFP connection between the two machines. On the client I see 'Connection Failed...' and in the server console I find:


22/07/2011 08:39:20.350 AppleFileServer: _Assert: /SourceCache/afpserver/afpserver-581/afpserver/AFPRequest.cpp, 2005
22/07/2011 08:39:20.354 AppleFileServer: _Assert: /SourceCache/afpserver/afpserver-581/afpserver/AFPRequest.cpp, 2005
22/07/2011 08:39:20.400 AppleFileServer: Calling ReadLock on lock < 0x2a833128 > while holding WriteLock
22/07/2011 08:39:20.524 ReportCrash: DebugSymbols was unable to start a spotlight query: spotlight is not responding or disabled.
22/07/2011 08:39:20.965 com.apple.launchd: (com.apple.AppleFileServer[424]) Job appears to have crashed: Illegal instruction: 4
22/07/2011 08:39:21.062 ReportCrash: Saved crash report for AppleFileServer[424] version ??? (???) to /Library/Logs/DiagnosticReports/AppleFileServer_2011-07-22-083921_localhost.crash
22/07/2011 08:39:22.389 com.apple.launchd: (com.apple.AppleFileServer) Throttling respawn: Will start in 8 seconds

I also have a crash report.

This is fairly annoying as my MacPro acts as the Time Machine backup for my MacBook. Has anyone experienced anything similar and are there any workarounds?

Update: I found a work around: I removed all folders from 'System Preferences/Sharing/File Sharing/Shared folders' and I can now connect. The unfortunate side effect of this workaround is that now the whole file system on the server is shared - subject to the authenticated user's privileges.