table of contents

Java Reference Objects

or
How I Learned to Stop Worrying and Love OutOfMemoryError

Introduction

I started programming with Java in 2000, after fifteen years with C and C++. I thought myself fairly competent at C-style memory management, using coding practices such as pointer handoffs, and tools such as Purify. I couldn't remember the last time I had a memory leak. So it was some measure of disdain that I approached Java's automatic memory management … and quickly fell in love. I hadn't realized just how much mental effort was expended in memory management, until I didn't have to do it any more.

And then I met my first OutOfMemoryError. Just sitting there on the console, with no accompanying stack trace … because stack traces require memory! Debugging that error was tough, because the usual tools just weren't available, not even a malloc logger. And the state of Java debuggers in 2000 was, to say the least, primitive.

I can't remember what caused that first error, and I certainly didn't resolve it using reference objects. They didn't enter my toolbox until about a year later, when I was writing a server-side database cache and tried using soft references to limit the cache size. Turned out they weren't too useful there, either, for reasons that I'll discuss below. But once reference objects were in my toolbox, I found plenty of other uses for them, and gained a better understanding of the JVM as well.

The Java Heap and Object Life Cycle

For a C++ programmer new to Java, the relationship between stack and heap can be hard to grasp. In C++, objects may be created on the heap using the new operator, or on the stack using "automatic" allocation. The following is legal C++: it creates a new Integer object on the stack. A Java compiler, however, will reject it as a syntax error.

Integer foo = Integer(1);

Java, unlike C++, stores all objects on the heap, and requires the new operator to create the object. Local variables are still stored on the stack, but they hold a pointer to the object, not the object itself (and of course, to confuse C++ programmers more, these pointers are called "references"). Consider the following Java method, which allocates an Integer, giving it a value parsed from a String:

public static void foo(String bar)
{
    Integer baz = new Integer(bar);
}

The diagram below shows the relationship between the heap and stack for this method. The stack is divided into "frames," which contain the parameters and local variables for each method in the call tree. Those variables that point to objects — in this case, the parameter bar and the local variable baz — point at objects living in the heap. diagram of relationship between stack and heap

Now look more closely at the first line of foo(), which allocates a new Integer object. Behind the scenes, the JVM first attempts to find enough heap space for this object — approximately 12 bytes. If able to allocate the space, it then calls the specified constructor to initialize the object and stores a pointer to the object in variable baz. If the JVM is unable to allocate the space, it calls the garbage collector in an attempt to make room.

Garbage Collection

While Java gives you a new operator to allocate objects on the heap, it doesn't give you a corresponding delete operator to remove them. When method foo() returns, the variable baz goes out of scope but the object it pointed to still exists on the heap. If this were the end of the story, all programs would quickly run out of memory. Java, however, provides a garbage collector to clean up these objects once they're no longer referenced.

The garbage collector goes to work when the program tries to create a new object and there isn't enough space for it in the heap. The requesting thread is suspended while the collector looks through the heap, trying to find objects that are no longer actively used by the program, and reclaiming their space. If the collector is unable to free up enough space, and the JVM is unable to expand the heap, the new operator fails with an OutOfMemoryError. This is normally followed by your application shutting down.

There are many excellent references as to how the Java garbage collector works, and some of them are listed at the end of this article. While they make great reading, and will teach you how to tune your JVM appropriately for the programs that you're running, for now all you need to know is that Java uses a form of mark-sweep-compact garbage collection, based on strong references.

Mark-Sweep-Compact

The idea behind mark-sweep-compact garbage collection is simple: every object that can't be reached by the program is garbage, and can be collected. This is a three-part process:

The garbage collector starts from “root” references, and walks through the object graph marking all objects that it reaches. diagram of heap with live objects marked
Then, it goes through all objects on the heap, and discards those that aren't marked. diagram of heap with dead objects removed
Finally, it compacts the heap, moving objects around to coalesce the free space left behind by the collected garbage. diagram of heap after compaction

So what are these "roots"? In a simple Java application, they're method parameters and local variables stored on the stack, the operands of the currently executing expression (also stored on the stack), and static class member variables.

In programs that use their own classloaders, such as app-servers, the picture gets muddy: only classes loaded by the system classloader (the loader used by the JVM when it starts) contain root references. Any classloaders that the application creates are themselves subject to collection, once there are no more references to them. This is what allows app-servers to hot-deploy: they create a separate classloader for each deployed application, and let go of the classloader reference when the application is undeployed or redeployed.

It's important to understand root references, because they define what a "strong" reference is: if you can follow a chain of references from a root to a particular object, then that object is "strongly" referenced. It will not be collected.

So, returning to method foo(), the parameter bar and local variable baz are strong references only while the method is executing. Once it finishes, they both go out of scope, and the objects they referenced are eligible for collection. In the real world, foo() would probably return the reference held in baz, meaning that it remains strongly referenced by foo()'s caller.

Now consider the following:

LinkedList foo = new LinkedList();
foo.add(new Integer(123));

Variable foo is a root reference, which points to the LinkedList object. Inside the linked list are zero or more list elements, each of which points to its successor. When we call add(), one of these elements will point to an Integer instance with the value 123. This is a chain of strong references, from a root reference, meaning that the Integer is not eligible for garbage collection. As soon as foo goes out of scope, however, the LinkedList and everything in it are eligible for collection — provided, of course, that there are no other strong references to it or its contents.

You may be wondering what happens if you have a circular reference: object A contains a reference to object B, which contains a reference back to A. The answer is that a mark-sweep collector isn't fooled: if neither A nor B can be reached by a chain of strong references, then they're eligible for collection.

Finalizers

C++ allows objects to define a destructor method: when the object goes out of scope or is explicitly deleted, its destructor is called to clean up the resources it used. For most objects, this means explicitly releasing the memory that the object allocated with new or malloc. In Java, the garbage collector handles memory cleanup for you, so there's no need for an explicit destructor to do this.

However, memory isn't the only resource that might need to be cleaned up. Consider FileOutputStream: when you create an instance of this object, it allocates a file handle from the operating system. If you let all references to the stream go out of scope before closing it, what happens to that file handle? The answer is that the stream has a finalizer method: a method that's called by the JVM just before the garbage collector reclaims the object. In the case of FileOutputStream, the finalizer closes the stream, which releases the file handle back to the operating system — and also flushes any buffers, ensuring that all data is properly written to disk.

Any object can have a finalizer; all you have to do is declare the finalize() method:

protected void finalize() throws Throwable
{
    // cleanup your object here
}

While finalizers seem like an easy way to clean up after yourself, they do have some serious limitations. First, you should never rely on them for anything important, since an object's finalizer may never be called — the application might exit before the object is eligible for garbage collection. There are some other, more subtle problems with finalizers, but I'll hold off on these until we get to phantom references.

Object Life Cycle without Reference Objects

Putting it all together, an object's life can be summed up by the simple picture below: it's created, it's used, it becomes eligible for collection, and eventually it's collected. The shaded area represents the time during which the object is "strongly reachable," a term that becomes important by comparison with the reachability provided by reference objects. object life-cycle, without reference objects

Enter Reference Objects

JDK 1.2 introduced the java.lang.ref package, and three new stages in the object life cycle: softly-reachable, weakly-reachable, and phantom-reachable. These states only apply to objects eligible for collection — in other words, those with no strong references — and the object in question must be the referent of a reference object:

softly reachable
The object is the referent of a SoftReference. The garbage collector will attempt to preserve the object as long as possible, but will collect it before throwing an OutOfMemoryError.
weakly reachable
The object is the referent of a WeakReference, and there are no strong or soft references to it. The garbage collector is free to collect the object at any time, with no attempt to preserve it. In practice, the object will be collected during a major collection, but may survive a minor collection.
phantom reachable
The object is the referent of a PhantomReference, and there are no strong, soft, or weak references to it. This reference type differs from the other two in that it isn't meant to be used to access the object, but as a signal that the object has already been finalized, and the garbage collector is ready to reclaim its memory.

As you might guess, adding three new optional states to the object life-cycle diagram makes for a mess. Although the documentation indicates a logical progression from strongly reachable through soft, weak, and phantom, to reclaimed, the actual progression depends on what reference objects your program creates. If you create a WeakReference but don't create a SoftReference, then an object progresses directly from strongly-reachable to weakly-reachable to finalized to collected. object life-cycle, with reference objects

It's also important to remember that not all objects are attached to reference objects — in fact, very few of them should be. A reference object is a layer of indirection: you go through the reference object to reach the referred object, and clearly you don't want that layer of indirection throughout your code. Most programs, in fact, will use reference objects to access a relatively small number of the objects that the program creates.

References and Referents

A reference object is a layer of indirection between your program code and some other object, called a referent. Each reference object is constructed around its referent, and the referent cannot be changed. relationships between application code, soft/weak reference, and referent

The reference object provides the get() method to retrieve a strong reference to its referent. The garbage collector may reclaim the referent at any point; once this happens, get() returns null. The following code shows this in action:

SoftReference<List<Foo>> ref = new SoftReference<List<Foo>>(new LinkedList<Foo>());

// create some Foos, probably in a loop
List<Foo> list = ref.get();
if (list == null)
    throw new RuntimeException("ran out of memory");
list.add(foo);

There are a few important things to note about this code:

  1. You must always check to see if the referent is null
    The garbage collector can clear the reference at any time, and if you blithely use the reference, sooner or later you'll get a NullPointerException.
  2. You must hold a strong reference to the referent to use it
    Again, the garbage collector can clear the reference at any time, even between two statements in your code. If you simply call get() once to check for null, and then call get() again to use the reference, it might be cleared between those calls.
  3. You must hold a strong reference to the reference object
    If you create a reference object, but allow it to go out of scope, then the reference object itself will be garbage-collected. Seems obvious, but it's easy to forget, particularly when you're using reference queues to track when the reference objects get cleared.

Also remember that soft, weak, and phantom references only come into play when there are no more strong references to the referent. They exist to let you hold onto objects past the point where they'd normally become food for the garbage collector. At first, this may seem like a strange thing — if you no longer have anything that points to the object, why would you care about it ever again?

Soft References

We'll start to answer that question with soft references. If there are no strong references to an object but it is the referent of a SoftReference, then the garbage collector is free to reclaim the object but will try not to. You can tune the garbage collector to be more or less aggressive at reclaiming softly-referenced objects.

The JDK documentation says that this is appropriate for a memory-sensitive cache: each of the cached objects is accessed through a SoftReference, and if the JVM decides that it needs space, then it will clear some or all of the references and reclaim their referents. If it doesn't need space, then the referents remain in the heap and can be accessed be program code. In this scenario, the referents are strongly referenced when they're being actively used, softly referenced otherwise. If a soft reference gets cleared, you'll need to refresh the cache.

To be useful in this role, however, the cached objects need to be pretty large — on the order of several kilobytes each. Useful, perhaps, if you're implementing a fileserver that expects the same files to be retrieved on a regular basis, or have large object graphs that need to be cached. But if your objects are small, then you'll have to clear a lot of them to make a difference, and the reference objects will add overhead to the whole process.

Soft Reference as Circuit Breaker

A better use of soft references is to provide a "circuit breaker" for memory allocation: put a soft reference between your code and the memory it allocates, and you avoid the dreaded OutOfMemoryError. This technique works because memory allocation tends to be localized within the application: you're reading rows from a database, or processing data from a file.

For example, if you write a lot of JDBC code, you might have a method like the following to process query results in a generic way and ensure that the ResultSet is properly closed. It only has one small flaw: what happens if the query returns a million rows?

public static List<List<Object>> processResults(ResultSet rslt)
throws SQLException
{
    try
    {
        List<List<Object>> results = new LinkedList<List<Object>>();
        ResultSetMetaData meta = rslt.getMetaData();
        int colCount = meta.getColumnCount();

        while (rslt.next())
        {
            List<Object> row = new ArrayList<Object>(colCount);
            for (int ii = 1 ; ii <= colCount ; ii++)
                row.add(rslt.getObject(ii));

            results.add(row);
        }

        return results;
    }
    finally
    {
        closeQuietly(rslt);
    }
}

The answer, of course, is an OutOfMemoryError, unless you have a gigantic heap or tiny rows. It's the perfect place for a circuit breaker: if the JVM runs out of memory while processing the query, release all the memory that it's already used, and throw an application-specific exception.

At this point, you may wonder: who cares? The query is going to abort in either case, why not just let the out-of-memory error do the job? The answer is that your application may not be the only thing affected. If you're running on an application server, your memory usage could take down other applications. Even in an unshared environment, a circuit-breaker improves the robustness of your application, because it confines the problem and gives you a chance to recover and continue.

To create the circuit breaker, the first thing you need to do is wrap the results list in a SoftReference (you've seen this code before):

    SoftReference<List<List<Object>>> ref
        = new SoftReference<List<List<Object>>>(new LinkedList<List<Object>>());

And then, as you iterate through the results, create a strong reference to the list only when you need to update it:

while (rslt.next())
{
    rowCount++;
    List<Object> row = new ArrayList<Object>(colCount);
    for (int ii = 1 ; ii <= colCount ; ii++)
        row.add(rslt.getObject(ii));

    List<List<Object>> results = ref.get();
    if (results == null)
        throw new TooManyResultsException(rowCount);
    else
        results.add(row);
    results = null;
}

This works because almost all of the method's memory allocation happens in two places: the call to next(), and the loop that calls getObect(). In the first case, there's a lot that happens when you call next(): the ResultSet typically retrieves a large block of binary data, containing multiple rows. Then, when you call getObject(), it extracts a piece of that data and wraps it in a Java object.

While those expensive operations happen, the only reference to the list is via the SoftReference. If you run out of memory the reference will be cleared, and the list will become garbage. It means that the method throws, but the effect of that throw can be confined. And perhaps the calling code can recreate the query with a retrieval limit.

Once the expensive operations complete, you can hold a strong reference to the list with relative impunity. However, note that it's a LinkedList: I know that linked lists grow in increments of a few dozen bytes, which is unlikely to trigger OutOfMemoryError. By comparison, if an ArrayList needs to increase its capacity, it must create a new array to do so. In a large list, this could mean megabytes.

Also note that I set the results variable to null after adding the new element — this is one of the few cases where doing so is justified. Although the variable goes out of scope at the end of the loop, the garbage collector does not know that (because there's no reason for the JVM to clear the variable's slot in the call stack). So, if I didn't clear the variable, it would be an unintended strong reference during the subsequent pass through the loop.

Soft References Aren't A Silver Bullet

While soft references can prevent many out-of-memory conditions, they can't prevent all of them. The problem is this: in order to actually use a soft reference, you have to create a strong reference to the referent: to add a row to the results, we need to have a reference to the actual list. During the time we hold that string reference, we are at risk for an out-of-memory condition. In this example we store the pointer to the list in a local variable, but even if we just used the value directly in an expression, it would be a strong reference for the duration of that expression.

The goal with a circuit breaker is to minimize the window during which it's useless: the time that you hold a strong reference to the object, and perhaps more important, the amount of allocation that happens during this time. In our case, we confine the strong reference to adding a row to the results, and we use a LinkedList rather than an ArrayList because the former grows in much smaller increments.

Also note that in the example, we hold the strong reference in a variable that quickly goes out of scope. However, the language spec says nothing about the JVM being required to clear variables that go out of scope, and in fact the Sun JVM does not do so. If we didn't explicitly clear the results variable, it would remain a strong reference throughout the loop, acting like a penny in a fuse box, and preventing the soft reference from doing its job.

There are some cases where you just can't make the window small enough. For example, let's say that you wanted to process a ResultSet into a DOM Document. You would have to dereference the document after every call to getObject(), and you might just find that the memory usage to create a new Element is large enough to push you into an OutOfMemoryError (although there are techniques, such as pre-allocating a sacrificial buffer, which may help).

Finally, think carefully about the strong references that you hold. For example, a DOM is typically processed recursively, and you might think of a recursive solution to adding rows, passing in the parent node. However, method arguments are strong references. And in a DOM, a reference to any node is the start of a chain of references to every other node — so if you pass a node into a method, you have just created a long-lasting strong reference to the entire DOM tree.

Weak References

A weak reference is, as its name suggests, a reference that doesn't even try to put up a fight to prevent its referent from being collected. If there are no strong or soft references to the referent, it's pretty much guaranteed to be collected.

So what's the use? There are two main uses: associating objects that have no inherent relationship, and reducing duplication via a canonicalizing map. The first case is best illustrated with a counter-example: ObjectOutputStream.

The Problem With ObjectOutputStream

When you write objects to an ObjectOutputStream, it maintains a strong reference to the object, associated with a unique ID, and writes that ID to the stream along with the object's data. This has two benefits if you later write the same object to the same stream: you save bandwidth, because the output stream only needs to send the ID, and you preserve object identity on the other end.

Unfortunately, it's also a form of memory leak, since the stream holds onto the source object forever — or at least until you close the stream or call reset() on it. If you're using object streams simply as a means to move objects, and aren't concerned about preserving identity or reducing bandwidth, then you quickly learn to call reset() on a regular basis.

If the ObjectOutputStream instead held the source object via a WeakReference, the problem wouldn't happen: when the object went out of scope in the program code, the collector could reclaim the object. Since there would be no way that it could ever be written to the stream again, there's no reason for the stream to hold onto it. Better, the ObjectOutputStream could notify the ObjectInputStream that the object is no longer valid, eliminating memory leaks on the receiving side.

Unfortunately, although the object stream protocol was updated with the 1.2 JDK, and weak references were added with 1.2, the JDK developers didn't think to combine them.

Using WeakHashMap to Associate Objects

To be honest, I don't believe there are many cases where you should associate two objects that don't have an inherent relationship. Either the objects should have a composition relationship, and be collected together, or they should have an aggregation relationship and be collected separately.

However this rule breaks down if you have no ability to change the objects to reflect their relationship — for example, if you need to form a composition relationship between a third-party class and an application class. It also breaks down in cases like ObjectOutputStream, which the relationship is ad hoc and the objects have differing lifetimes.

Should you find the need to create such an association, the JDK provides WeakHashMap, which holds its keys via weak references. When the key is no longer referenced anywhere else within the application, the map entry is no longer accessible. In practice, the entry remains in the map until the next time the map is accessed, so you may find your related objects sitting in the heap far longer than they should.

Rather than give an example here, we'll look at WeakHashMap in the context of a canonicalizing map.

Eliminating Duplicate Data with Canonicalizing Maps

In my opinion, a far better use of weak references is for canonicalizing maps. And the best example of how a canonicalizing map works — even though it's written as a native method — is String.intern(). When you intern a string, you get a single, canonical instance of that string back. If you're processing some input source with a lot of duplicated strings, such as an XML or HTML document, interning strings can save an enormous amount of memory.

A simple canonicalizing map works by using the same object as key and value: you probe the map with an arbitrary instance, and if there's already a value in the map, you return it. If there's no value in the map, you store the instance that was passed in (and return it). Of course, this only works for objects that can be used as map keys. Here's how we might implement String.intern():

private Map<String,String> _map = new HashMap<String,String>();

public synchronized String intern(String str)
{
    if (_map.containsKey(str))
        return _map.get(str);
    _map.put(str, str);
    return str;
}

This implementation is fine if you have a small number of strings to intern. However, let's say that you're writing a long-running application that has to process input that contains a wide range of strings that still have a high level of duplication. For example, an HTTP server that canonicalizes the headers in its requests. There are only about a dozen values that you'll see in a "User-Agent" header, yet some of those values occur more frequently than others — the Googlebot only visits once a week.

In this case, you can reduce long-term memory consumption by holding the canonical instance only so long as some code in the program is using it. And this is where weak references come in: by holding the map entries as weak references, they will become eligible for collection after the last strong reference disappears. Once the Googlebot has finished indexing your site its user agent string will be collected.

To improve our canonicalizer, we can replace HashMap with a WeakHashMap:

private Map<String,WeakReference<String>> _map
    = new WeakHashMap<String,WeakReference<String>>();

public synchronized String intern(String str)
{
    WeakReference<String> ref = _map.get(str);
    String s2 = (ref != null) ? ref.get() : null;
    if (s2 != null)
        return s2;

    // as-of 1.5, still possible for a string to reference a much larger
    // shared buffer; creating a new string will trim the buffer
    str = new String(str);
    _map.put(str, new WeakReference(str));
    return str;
}

First thing to notice is that, while the map's key is a String, its value is a WeakReference<String>. This is because WeakHashMap only uses WeakReference for its keys; the Map.Entry holds a strong reference to the value. If we did not wrap the value in its own WeakReference, that strong reference would never allow the string to be collected.

Second, since we're holding the values via a reference object, we have to ensure that we establish a strong reference before returning. We can't simply return ref.get(), because it's possible that the reference will be cleared between the time we verify its contents and the time we return. So we create the strong reference s2, verify it, and then return if it's not null.

Thirdly, note that I've synchronized the intern() method. The most likely use for a canonicalizing map is in a multi-threaded environment such as an app-server, and WeakhashMap isn't synchronized internally. The synchronization in this example is actually rather naive, and the intern() method can become a point of contention. Realistically, you could wrap the map with Collections.synchronizedMap(), understanding that two concurrent calls with the same string may return different instances. However, only one instance will go into the map, and since our goal is to reduce duplication, that should be acceptable. The naive approach is better for a tutorial.

One final thing to know about WeakHashMap is that its documentation is somewhat misleading. Above, I noted that it's not synchronized internally, but the documentation states "a WeakHashMap may behave as though an unknown thread is silently removing entries." While that may be how it appears, in reality there is no other thread; instead, the map cleans up whenever it's accessed. To keep track of which entries are no longer valid, it uses a reference queue.

Reference Queues

While testing a reference for null lets you know whether its referent has been collected, doing so requires that you interrogate the reference. If you have a lot of references, it would be a waste of time to interrogate all of them to discover which have been cleared. The alternative is a reference queue: when you associate a reference with a queue, the reference will be put on the queue after it has been cleared.

You associate a reference object with a queue at the time you create the reference. Thereafter, you can poll the queue to determine when the reference has been cleared, and take appropriate action (WeakHashMap, for example, removes the map entries associated with those references). Depending on your needs, you might want to set up a background thread that periodically polls the queue, blocking until references become available.

Reference queues are most often used with phantom references, described below, but can be used with any reference type. The following code is an example with soft references: it creates a bunch of buffers, accessed via a SoftReference, and after every creation looks to see what references have been cleared. If you run this code, you'll see long runs of create messages, interspersed with an occasional run of clear messages (each run of the garbage collector will clear multiple references).

public static void main(String[] argv) throws Exception
{
    List<SoftReference<byte[]>> refs = new ArrayList<SoftReference<byte[]>>();
    ReferenceQueue<byte[]> queue = new ReferenceQueue<byte[]>();

    for (int ii = 0 ; ii < 10000 ; ii++)
    {
        SoftReference<byte[]> ref
            = new SoftReference<byte[]>(new byte[10000], queue);
        System.err.println(ii + ": created " + ref);
        refs.add(ref);

        Reference<? extends byte[]> r2;
        while ((r2 = queue.poll()) != null)
        {
            System.err.println("cleared " + r2);
        }
    }
}

As always, there are things to note about this code. First, although we're creating SoftReference instances, we get Reference instances back from the queue. This serves to remind you that, once they're enqueued, it no longer matters what type of a reference you're using: the referent has already been cleared.

Second is that we must keep track of the reference objects via strong references. The reference object knows about the queue; the queue doesn't know about the reference until it's enqueued. If we didn't maintain the strong reference to the reference object, it would itself be collected, and we'd never be the wiser. We use a List in this example, but in practice, a Set is a better choice because it's easier to remove those references once they're cleared.

Phantom References

Phantom references differ from soft and weak references in that they're not used to access their referents. Instead, their sole purpose is to tell you when their referent has already been collected. While this seems rather pointless, it actually allows you to perform resource cleanup with more flexibility than you get from finalizers.

The Trouble With Finalizers

Back in the description of object life cycle, I mentioned that finalizers have subtle problems. These problems come about because object finalization happens after the object has been marked as garbage, but before its memory is reclaimed. With a little ingenuity, you can use finalizers to cause out-of-memory conditions even when there are no strongly-referenced objects in your program.

The first of the problems with finalizers is that they may not be invoked. If your program never runs out of available memory, the garbage collector will never identify objects that need to be finalized, and their finalizers will never be called. This is a particular concern when the finalizer exists to clean up JNI-allocated resources: if the Java-side objects are small, you could easily run out of C heap space before any of them get collected. The only way around this problem is to manually clean up your objects.

A second problem is that finalize() is that it's allowed to create a strong reference to the object — to resurrect the object. If you've ever read Stephen King, you will probably guess that the resurrected object "isn't quite right": in particular, its finalizer will never be executed again. While this seems scary on the surface, most people who write finalizers aren't trying to resurrect their objects. And even if they do, ultimately they can't change the order of the universe, and once the object becomes unreachable again it will be eligible for collection.

The real problem with finalizers is that they introduce a discontinuity between the time that the object is identified for collection and the time that its memory is reclaimed. The JVM is guaranteed to perform a full collection before it returns OutOfMemoryError, but if the only objects eligible for collection happen to have a finalizer, then the collection will have little effect. Throw in the fact that a JVM may only have a single thread responsible for finalization of all objects, and you start to see the problems.

The following program demonstrates this: each object has a finalizer that sleeps for half a second. Not much time at all, unless you've got thousands of objects to clean up. Every object goes out of scope immediately after it's created, yet at some point you'll run out of memory (this will happen faster if you reduce the maximum heap size with -Xmx).

public class SlowFinalizer
{
    public static void main(String[] argv) throws Exception
    {
        while (true)
        {
            Object foo = new SlowFinalizer();
        }
    }

    // some member variables to take up space -- approx 200 bytes
    double a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z;

    // and the finalizer, which does nothing by take time
    protected void finalize() throws Throwable
    {
        try { Thread.sleep(500L); }
        catch (InterruptedException ignored) {}
        super.finalize();
    }
}

The Phantom Knows

Phantom references allow the application to know when an object is no longer used, so that the application can clean up the object's non-memory resources. Unlike finalizers, cleanup is controlled by the application. If the application creates objects using a factory method, that method can be written to block until some number of outstanding objects have been collected. No matter how long it takes to do cleanup, it won't affect any thread other than the one calling the factory.

Phantom references differ from soft and weak references in that your program does not access the actual object through the reference. In fact, if you call get(), it always returns null, even if the referent is still strongly referenced. Instead, you use the phantom reference to hold a second strong reference to resources used by the referent: relationships between application code, phantom reference, and referent

While this seems strange, the purpose of the phantom reference is simply to let you know when the referent has been reclaimed. Your program still needs to be able to access the resources in order to reclaim them, so the referent can't be the sole path to those resources. The application must rely on a reference queue to report when the referent has been collected.

Implementing a Connection Pool with Phantom References

Database connections are one of the most precious resources in any application: they take time to establish, and database servers place strict limits on the number of simultaneous open connections that they'll accept. For all that, programmers are remarkably careless with them, sometimes opening a new connection for every query and either forgetting to close it or forgetting to close it in a finally block. While the Connection object itself could use a finalizer to release actual resources, that's still dependent on the whim of the garbage collector, and it doesn't limit the number of connections that can be open at any time.

By using phantom references, we gain control over the number of open connections, and can block until one becomes available. In the example, we don't go further than that, but it would be a simple matter to add in a reaper that reclaims connections that have been open/unused too long.

The first part of the connection pool is the PooledConnection object. This is the object that is given to the application to use — the referent of our phantom reference. It implements the JDBC Connection interface, and delegates all operations to the actual connection handed out by the DriverManager.

public class PooledConnection
implements Connection
{
    private ConnectionPool _pool;
    private Connection _cxt;

Since our PooledConnection implements the Connection interface, it must expose all of the methods of that interface. When called, these methods simply delegate to the embedded connection object. I use the internal method getConnection() rather than directly accessing the _cxt field for two reasons: first, so I can apply some checks for the validity of the connection, and second, so that I can override in a test case.

public void commit() throws SQLException
{
    getConnection().commit();
}

For now, that's enough said about the PooledConnection object. Now let's look at the ConnectionPool, which is a factory for new pooled connections via the getConnection() method. This is a blocking method: if there isn't a connection available in the pool, it will wait until one becomes available (signified by its reference being enqueued).

public synchronized Connection getConnection()
throws SQLException
{
    while (true)
    {
        if (_pool.size() > 0)
            return wrapConnection(_pool.remove());
        else
        {
            try
            {
                Reference<?> ref = _refQueue.remove(100);
                if (ref != null)
                    releaseConnection(ref);
            }
            catch (InterruptedException ignored)
            {
                // this could be used to shut down pool
            }
        }
    }
}

From this method, you should be able to infer that we have a queue of some sort containing our actual connections, and also a reference queue that we'll use to track when our PooledConnection objects get collected. The call to releaseConnection() should give you a hint that we keep track of the pooled connections via their references, so let's take a look at the internal data structures:

private Queue<Connection> _pool
    = new LinkedList<Connection>();

private ReferenceQueue<Object> _refQueue
    = new ReferenceQueue<Object>();

private IdentityHashMap<Object,Connection> _ref2Cxt
    = new IdentityHashMap<Object,Connection>();

private IdentityHashMap<Connection,Object> _cxt2Ref =
    new IdentityHashMap<Connection,Object>();

What's happening here is that the pool maintains two lookup tables: one from the reference object to the actual connection, and one from the actual connection to the reference object. Both tables use IdentityHashMap, because we care about the actual object, and don't want a potential override of equals() to get in our way. Note that the two lookup tables also serve as our strong references to the phantom reference instances, so that they won't get collected.

Assuming that there are connections in the pool, the wrapConnection() method handles the bookkeeping needed to track that connection. It creates a PooledConnection instance, which is handed to the caller, and a PhantomReference to refer to that instance. It then inserts these objects in the lookup tables.

private synchronized Connection wrapConnection(Connection cxt)
{
    Connection wrapped = new PooledConnection(this, cxt);
    PhantomReference<Connection> ref = new PhantomReference<Connection>(wrapped, _refQueue);
    _cxt2Ref.put(cxt, ref);
    _ref2Cxt.put(ref, cxt);
    return wrapped;
}

Its counterpart is releaseConnection(), which comes in two flavors. The first is meant to be called from within the pool, when the phantom reference is enqueued. It uses the reference to find the actual connection.

synchronized void releaseConnection(Reference<?> ref)
{
    Connection cxt = _ref2Cxt.remove(ref);
    if (cxt != null)
        releaseConnection(cxt);
}

The second version is meant to be called from the PooledConnection itself, when the application explicitly closes that connection (it's also called from the first version). It clears out the bookkeeping objects, and puts the actual connection back into the pool.

synchronized void releaseConnection(Connection cxt)
{
    Object ref = _cxt2Ref.remove(cxt);
    _ref2Cxt.remove(ref);
    _pool.offer(cxt);
    System.err.println("Released connection " + cxt);
}

To go full circle, we'll look at the PooledConnection's close() method, which not only returns the connection to the pool, but also ensures that it won't be used again. Remember: this method will only be called by application code, to explicitly close the connection. If the pool decides to close the connection, the PooledConnection instance will be long gone.

public void close() throws SQLException
{
    if (_cxt != null)
    {
        _pool.releaseConnection(_cxt);
        _cxt = null;
    }
}

The Trouble with Phantom References

Several pages back, I noted that finalizers are not guaranteed to be called. Neither are phantom references. If the collector doesn't run, it will never collect unreachable objects, and any phantom references won't be enqueued. Consider what would happen if your program used the connection pool above, and threw an uncaught exception immediately after calling getConnection().

The answer is that it would quickly exhaust the pool, and all further requests would block. If your program didn't do anything else that would cause a garbage collection, pretty soon every thread would be blocked, waiting for connections that will never return to the pool.

However, even in this situation, phantom references have an advantage over finalizers: cleanup is under your control. True, with finalizers you could call System.gc() in the hopes that will cause the collector to get to work, but there's no guarantee: per the documentation, it "suggests that the Java Virtual Machine expend effort toward recycling unused objects" (emphasis added).

By comparison, the connection pool could run through its list of outstanding connections, and force them to close, without relying on the finalizer (to be fair, you could make the same thing happen with a finalizer, but at that point you're already more than halfway to an implementation using references).

A Final Thought: Sometimes You Just Need More Memory

While reference objects are a tremendously useful tool to manage your memory consumption, sometimes they're not sufficient and sometimes they're overkill. For example, let's say that you're building a large object graph, containing data that you read from the database. While you could use soft references as a circuit breaker for the read, and weak references to canonicalize that data, ultimately your program requires a certain amount of memory to run. If you can't actually accomplish any work, it doesn't matter how robust your program is.

Your first response to OutOfMemoryError should be to figure out why it's happening. And to do that, simply increase your heap size with the -Xmx parameter. Personally, I find it strange that the JVM expects you to set explicit limits on memory consumption: the rest of the world seems to have no problems with relying on virtual memory management to do the right thing. For whatever reasons, the JVM is different, and by default it doesn't give you a very large default allotment: 64Mb in the Sun 1.5 JVM.

During development, you should specify a large heap size — 1 Gb or more if you have the physical memory - and pay careful attention to how much memory is actually used (most IDEs provide a way to monitor heap usage, or you can turn to classes in the java.lang.management package or even Runtime.totalMemory()). Most applications will reach a steady state under simulated load, and that should guide your production heap settings . If your memory usage climbs over time, it's quite probable that you are holding strong references to objects after they're no longer in use. Reference objects may help here, but it's more likely that you've got a bug that should be fixed.

The bottom line is that you need to understand your applications. A canonicalizing map won't help you if you don't have duplication. Soft references won't help if you expect to execute multi-million row queries on a regular basis. But in the situations where they can be used, reference objects are often life savers.

Additional Information

You can download the sample code for this article. This JAR contains both source and executables, with “runner” classes.

The “string canonicalizer” class is available from SourceForge, licensed under Apache 2.0.

Sun has many articles on tuning their JVM's memory management. This article is an excellent introduction, and provides links to additional documentation.

Brian Goetz has a great column on the IBM developerWorks site, "Java Theory and Practice." A few years ago, he wrote columns on using both soft and weak references. These articles go into depth on some of the topics that I simply skimmed over, such as using WeakHashMap to associate objects with different lifetimes.

Copyright © Keith D Gregory, all rights reserved