Java Serialization

Java's built-in serialization mechanism has a bad reputation; everybody seems to have a story of serialization gone wrong. But I don't think it deserves that reputation. Yes, it's verbose, but that only matters in high-volume applications. Yes, it's finicky about versioning, but if you take care you can add or remove fields without pain. And yes, there are quirks in ObjectOutputStream that can cause memory leaks and incorrect data, but that's a matter of knowing how the stream works.

Serialization Basics

For objects built from primitives and other serializable objects (which includes most of the “data” objects from the JDK), you enable serialization simply by implementing the Serializable marker interface.

public class BasicSerializableClass
implements Serializable
{
    private static final long serialVersionUID = 1L;

    private int ival;
    private String sval;

    public BasicSerializableClass(int i, String s)
    {
        ival = i;
        sval = s;
    }

    public int getIval()
    {
        return ival;
    }

    public String getSval()
    {
        return sval;
    }
}

The second piece to serialization are the streams: ObjectOutputStream to write your objects, and ObjectInputStream to read them. Object streams are decorators for an underlying input or output stream. In this example I used a file, because I wanted to preserve the serialized data for the next section.

public static void main(String[] argv)
throws Exception
{
    File tmpFile = File.createTempFile("example", ".ser");
    tmpFile.deleteOnExit();

    BasicSerializableClass orig = new BasicSerializableClass(123, "Hello, World");

    ObjectOutputStream oos = new ObjectOutputStream(new FileOutputStream(tmpFile));
    oos.writeObject(orig);
    oos.close();

    ObjectInputStream ois = new ObjectInputStream(new FileInputStream(tmpFile));
    BasicSerializableClass rslt = (BasicSerializableClass)ois.readObject();
    ois.close();

    System.out.println("result.ival = " + rslt.getIval());
    System.out.println("result.sval = " + rslt.getSval());
}

I think it's instructive to look at what is actually written to the stream. For details, see the protocol spec. However, you can get a sense from the following dump: after a prologue comes the classname of the serialized object, followed by the name, type, and value for each of its fields.

00000000  AC ED 00 05 73 72 00 3A 63 6F 6D 2E 6B 64 67 72        sr :com.kdgr
00000010  65 67 6F 72 79 2E 65 78 61 6D 70 6C 65 2E 73 65    egory.example.se
00000020  72 69 61 6C 69 7A 61 74 69 6F 6E 2E 42 61 73 69    rialization.Basi
00000030  63 53 65 72 69 61 6C 69 7A 61 62 6C 65 43 6C 61    cSerializableCla
00000040  73 73 00 00 00 00 00 00 00 01 02 00 02 49 00 04    ss           I
00000050  69 76 61 6C 4C 00 04 73 76 61 6C 74 00 12 4C 6A    ivalL  svalt  Lj
00000060  61 76 61 2F 6C 61 6E 67 2F 53 74 72 69 6E 67 3B    ava/lang/String;
00000070  78 70 00 00 00 7B 74 00 0C 48 65 6C 6C 6F 2C 20    xp   {t  Hello,
00000080  57 6F 72 6C 64                                     World

As I said at the start of this article, the serializaton format is verbose: our sample object contained 17 bytes of actual data (4 for the int, 13 for the UTF-8 encoded string) yet the serialized version takes 133 bytes.

Evolving a Serializable Object

Incompatible changes are one of the first things that people stumble over with regards to serialization: they write an object, then make some seemingly minor change to the object's class, and find they can't reload the serialized data. Which is unfortunate, because the serialization protocol is actually quite resilient to changes. Provided that you follow a few rules.

The first of those rules is that you must always define serialVersionUID. If you don't, then the serialization mechanism creates its own value by hashing class metadata, including not just member variables, but all of the method names and their access modifiers. If the value written to the stream differs from the value of the destination class, you won't be able to deserialize the object.

But if you do pay attention to versioning, you can make an extraordinary number of changes to your serializable classes, and they'll still be readable. For example, here's a new version of the class that started this article.

public class BasicSerializableClass
implements Serializable
{
    private static final long serialVersionUID = 1L;

    private int intval;
    private String sval;
    private BigDecimal newVal;

    public BasicSerializableClass(int i, String s, BigDecimal bd)
    {
        intval = i;
        sval = s;
        newVal = bd;
    }

    public int getIval()
    {
        return intval;
    }

    public String getString()
    {
        return sval;
    }

    public BigDecimal getNewVal()
    {
        return newVal;
    }
}

So that you don't have to page back and forth, here are the changes to this class:

While these changes seem extensive, they are compatible:

So what constitutes an incompatible change? In practice, the only changes that you have to worry about are changes to types: either of the object itself or of any of its fields. For example, if you changed ival from int to long.

That said, your program might also consider some changes incompatible, even if serialization doesn't. For example, if you were expecting getIval() to return something other than zero, or getNewval() to return something other than null (a more likely situation). However, if you're the person writing the program, then you have control over how it handles incompatible data.

I'm going to finish this section with a comment on serialVersionUID values: you'll note that used 1L, to indicate the first version of the class. If I make an incompatible change, I'll increment it to 2L, and increment again for future changes. I believe that this is the easiest way to keep track of class evolution. There are tools that will give you a hashed value, but those values will require you, the programmer, to pore through source control to see the various changes.

On the other hand, if you already wrote serialized data without an explicit value, then you need to use the same value to deserialize the data. Before modifying the class, use a tool such as serialver to generate the hashed value. But for subsequent incompatible changes, I still recommend using simple incremented numbers.

Reading and Writing Non-Serializable Components

As I said at the start of this article, many of the classes in the JDK are serializable. So your objects can reference a BigInteger, or a Calendar, or even a Class, and have no problem. But what if your class holds a JarFile?

public class UnserializableObject
implements Serializable
{
    private static final long serialVersionUID = 1L;

    private String id;
    private JarFile jar;

Yes, this is a contrived example: there aren't a lot of real-world cases where you'd need to do this, but it was hard to find a simple class in the JDK that was not serializable. You're more likely to need custom serialization when using classes from a third-party library. That said, I have an application that uses a HashMap<String,JarFile> as a lookup table for the dependencies in a Maven project; it's expensive to build, and therefore I might want to serialize it as a performance optimization for multiple runs with the same project.

As written, UnserializableObject claims to be serialiable. The compiler believes that claim; it can't verify that all potentially referenced objects are in fact serializable. But when you try to write an instance of this object to a stream, you'll get a NotSerializableException.

Object streams allow objects to define their own serialization methods, so if we can find a way to preserve and reconstruct the non-serializable field, we can prevent this exception. In the case of JarFile, this is easy: you can retrieve the name of the file, and construct a new instance with the same name. Of course, if the file doesn't exist when you're ready to deserialize, then you can't read the object back; this might happen if you try to load the serialized representation to a different machine, or by a different user.

To make an unserializable object serializable, add the writeObject() and readObject() methods. Pay careful attention to the method signatures: if you don't follow the exact signature (including making the methods private), the streams won't call them.

private void writeObject(ObjectOutputStream out)
throws IOException
{
    out.writeObject(id);
    if (jar == null)    out.writeObject(null);
    else                out.writeObject(jar.getName());
}

private void readObject(ObjectInputStream in)
throws IOException, ClassNotFoundException
{
    id = (String)in.readObject();
    String jarName = (String)in.readObject();
    if (jarName != null)
    {
        jar = new JarFile(jarName);
    }
}

As I said, JarFile provides a getName() function, which is enough information to reconstruct the object. Note that I had to account for the possibility that jar was null. Also note that, if the file doesn't exist when deserializing, the JarFile constructor throws FileNotFoundException, which is a subclass of IOException and therefore meets the signature requirements. You will need to trap any other checked exceptions and convert them as appropriate.

I'll wrap up this section by noting that there's another way to handle custom serialization: implement the Externalizable interface. This gives you complete control over the process; it could be a way to avoid the overheads introduced by the normal serialization protocol. But, frankly, it's a pain to implement for anything other than simple data holders; you have to handle your entire superclass hierarchy. If overhead is your concern, I suggest switching to an alternative serialization mechanism, such as Avro or Protocol Buffers.

When to use Transient Fields

While readObject() and writeObject() are useful for handling data objects that weren't designed with serialization in mind, there are some classes that make no sense to serialize. A MappedByteBuffer, for example, represents a segment of memory within the current process' address space. That mapping won't exist in another process; you'll need to create it anew, given the raw materials of filename, offset, and length.

You could write custom serialization and deserialization code that automatically creates the buffer on the destination. But unlike JarFile, there's no way to retrieve the necessary information from the buffer itself; you need to explicitly store the name/offset/size as instance variables. Given that, it makes more sense to mark the buffer transient, let the stream handle serialization, and lazily recreate the buffer on use:

public class TransientExample
implements Serializable
{
    private static final long serialVersionUID = 1L;

    private File mappedFile;
    private transient MappedByteBuffer buffer;

    public TransientExample(File file)
    {
        this.mappedFile = file;
    }

    public MappedByteBuffer getBuffer()
    throws IOException
    {
        if (buffer == null)
        {
            // map buffer; note that opening file may throw
        }
        return buffer;
    }

The same principle holds for any derived object: if you already store enough information to reconstruct the object as needed, do so rather than writing custom serialization code. It's more maintainable to let the stream ensure that all the basic fields arrive at the destination, and for you to focus on the things the stream can't.

ObjectOutputStream Object Sharing

Once you get past version errors and unserializable objects, the next biggest source of problems with Java serialization is that the object streams retain a reference to every object written. If the same object is written to the stream multiple times, the second and subsequent writes use a unique ID, rather than the actual object data. When reading, the input stream recognizes these IDs and uses the first instance.

In most use cases, this is a great feature. It reduces the amount of data sent over the stream, and it ensures that the “shape” of the data will be preserved: if your application depends on the fact that the same object exists at two places, it won't break because the serialization code reconsituted a second instance. Plus — and to me, more important — it prevents infinite recursion:


public class GraphNode
{
    private List<GraphNode> incoming = new ArrayList<GraphNode>();
    private List<GraphNode> outgoing = new ArrayList<GraphNode>();

This object represents a node in a directed graph; it can have zero or more incoming connections from other nodes, and zero or more outgoing connections. If one of those connections happened to form a loop, and the output stream didn't keep track of objects already written, it would keep following those links until it ran out of stack. Clearly not optimal.

On the other hand, if you're using serialization as a simple way to implement a message protocol, this behavior can lead to bugs. The first happens when you have mutable objects.

public class SharedMutableObjectExample
{
    private static class MyMutableObject
    implements Serializable
    {
        private static final long serialVersionUID = 1L;
        private int value;

        public void setValue(int value)
        {
            this.value = value;
        }

        public int getValue()
        {
            return value;
        }
    }


    public static void main(String[] argv)
    throws Exception
    {
        ByteArrayOutputStream bos = new ByteArrayOutputStream();
        ObjectOutputStream oos = new ObjectOutputStream(bos);

        MyMutableObject obj = new MyMutableObject();
        for (int ii = 0 ; ii < 5 ; ii++)
        {
            obj.setValue(ii);
            oos.writeObject(obj);
        }

        oos.close();
        ByteArrayInputStream bis = new ByteArrayInputStream(bos.toByteArray());
        ObjectInputStream ois = new ObjectInputStream(bis);

        for (int ii = 0 ; ii < 5 ; ii++)
        {
            MyMutableObject ret = (MyMutableObject)ois.readObject();
            System.out.println("read #" + ii + " value = " + ret.getValue());
        }
    }
}

If you run this, you'll see the same value, 0, on every line of the output. This is because the stream saw that you were writing the same object over and over, so it only wrote the object's identifier, not its actual value.

One solution to the problem is to replace the call to writeObject() with a call to writeUnshared(). This method instructs the stream to fully serialize objects, without attempting to replace “known” objects by heir references. However, it has one critical limitation: it only writes an unshared copy of the passed object; any referenced objects are written as shared. So, if your base object happens to have a byte[] as one of its instance variables, that array is only serialized once, and you;ll see the same data over and over again.

I'm sure that the JDK developers had a reason for this behavior, but to me it's just a bug waiting to happen. Rather than use mutable objects, I much prefer to create a new message for each write.

However, that highlights another issue with object sharing. Since the stream keeps a hard reference to all objects written, the garbage collector never tries to reclaim their memory. And we all know where that leads:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

Whether or not you actually see this error depends on how many messages you send over the stream, and how much heap you've allocated. But even if you never see the exception, the leak is still there: jconsole will show an ever-growing heap. And a heap dump should have your message object(s) at the top of the instance count.

Rather than rely on writeUnshared(), I recommend calling reset() after writing each message:

oos.writeObject(myUnsharableObject);
oos.reset();

Calling reset() does two things: it clears the output stream's table of object references, and it writes a single-byte control code onto the stream. On the other end, the input stream sees that control code and clears its own table of saved object references. Yes, this will add a small amount of overhead to each object sent over the stream, but in my opinion that's a small price to pay for bug-free communication.

Closing Thoughts

I originally titled this section “When Not To Use Serialization,” but decided that would give the wrong impression. The decision of when to use or not use serialization depends on your need for exchanging objects with non-Java systems, as well as the longevity of your objects and expected level of structural change.

It may surprise you that I don't think of performance as one of the criteria for (not) choosing Java serialization. While the stream protocol is verbose when compared to a raw binary protocol like Protocol Buffers, it just doesn't matter for most purposes: network and CPU are cheap. I've used serialization to support message rates on the order of 10,000 msg/sec/node without a problem. Plus, alternative formats require you to describe your data in an external file, violating the DRY (“don't repeat yourself”) principle.

Interoperability is a far more important reason to forego Java serialization. If you have to exchange messages between Java and non-Java applications, Java serialization will just get in your way. Yes, the stream format is published, but you'll have to write or find a tool to parse it. If you have to share data objects between Java and non-Java programs, I think the best approach is to create a project that just consists of those data objects, and using a tool like Protocol Buffers to manage serialization.

Related to compatibility: don't store serialized data as a BLOB in a relational database. The whole reason for using a relational database is the ability to relate data in different tables. If you store opaque data in the database, you lose this ability; the database becomes little more than a filesytem (albeit one with transactions). If you're thinking of doing this, you'll probably find that an alternative storage mechanism (maybe a key-value store) is a better choice.

And if you do decide to store serialized data in a database, even in a key-value store, you might discover (too late) that your classes have evolved in incompatible ways and that data is no longer usable.

In my opinion, longevity is the biggest reason not to use Java serialization. Data does change over time, and not all of those changes will be compatible. If your data model is evolving, you should take the time to evolve your data with the model. Using Java serialization prevents you from doing that: there's no easy way to load an instance of MyClass where a particular field contains an int and write it back out with that field marked as a long (the difficult way involves multiple classloaders and some glue code). If you're thinking of long-term data storage, use a real database.

With all that said, for simple messaging and preservation of data between executions of the same program, I think serialization is hard to beat.

Security

Since this article was first written, Java object serialization has been exploited as an attack vector by hackers. If you'd like details on how this works, read my blog post and the linked slide deck (which describes similar attacks using other languages). Here I will limit myself to a short description of the problem, and some steps that you can take to prevent becoming a victim.

The root causes of the attack are simple:

  1. The program deserializes untrusted data.
  2. Somewhere on the classpath is a class that allows execution of code specified as data.

Of these, you can only reasonably control the first, but it's important to understand the second to avoid uninformed decisions.

To deserialize an object, your program must be able to load that class' bytecode from somewhere on the classpath. This requirement may give you a false sense of security: after all, you're not going to write a class that sends sensitive data to a hacker. However, most applications today don't consist solely of classes that you've written, they include dozens — and, after transitive dependencies, perhaps hundreds — of external libraries. So the real problem becomes do any of those libraries have exploitable classes?

As it turns out, many libraries do. In the case that I examined, Apache Commons Collections provided two classes that could be used for an exploit: LazyMap, which is a Map that uses factories to retrieve objects, and InvokerTransformer, which is a factory that uses reflection to create objects. This allowed arbitrary code to be executed just by calling get() on the map.

It's important to understand that commons-collections is not a bad library because it has these classes. The classes are very useful, as is the library as a whole. Which is why it's present on the classpath of many applications. And there are many other good libraries that have similar exploitable classes, just waiting for a hacker with incentive to find them.

The real problem — and fortunately, the one you can prevent — is deserializing untrusted data. The definition of untrusted data is simple: any data that you didn't create and control throughout its lifetime. Here are some examples:

In my blog post, I also talked about how the vulnerability was exploited using a class that defined a member variable as a Map, rather than using a concrete class. To the extent that you can, I recommend sticking to concrete members in serializable classes. However, doing so does not guarantee that you'll be safe, because you have no control over the variables that class or its dependencies define.

Bottom line: don't deserialize untrusted data.

For More Information

If you want more information about the mechanics of serialization, I recommend reading the Object Serialization Stream Protocol specification. It's a relatively simple protocol, and is useful for answering questions of the form “what should I expect if…”

The examples from this article are available as compilable programs. Note that you might need external libraries (eg, Apache Commons IO).

Copyright © Keith D Gregory, all rights reserved