Writing a Java Micro-benchmark

Originally published: 2008-11-16

Last updated: 2014-07-10

A recent issue from the real world: a home-grown templating engine that used String.replaceAll to generate output. Was it causing excessive memory and CPU load, or was it a harmless example of over-engineering? Testing in situ, in the running application server, would not be easy: how would we differentiate the effect of this code from everything else involved in building a page? The answer: use a micro-benchmark to quantify the difference between the two implementation options.

Caveat: This article was originally written circa 2008, using JDK 1.5 and a single-core IBM T-42 as a test platform. While the general points remain valid, there have been many improvements in both machines and JDK in the intervening years; all numbers should be taken with a large grain of salt.

What is a Micro-Benchmark?

A benchmark is any program that attempts to measure computing performance. Traditional “macro” benchmarks exist as a means to compare different platforms in specific areas: for example, the Whetstone benchmark focused on floating-point computation, so was used (and often abused) to compare platforms for scientific number-crunching applications. By comparison, a micro-benchmark attempts to measure the performance of a small piece of code — often a single function call — on a single platform. Their results are typically used to choose between competing implementations that provide the same or similar functionality.

For example, here's a simple micro-benchmark that measures the time taken by String.replaceAll():

public static void main(String[] argv)
throws Exception
{
    long start = System.currentTimeMillis();
    for (int ii = 0 ; ii < 1000000 ; ii++)
    {
        String s = "this XXX a test".replaceAll("XXX", " is ");
    }
    long elapsed = System.currentTimeMillis() - start;
    System.out.println("elapsed time = " + elapsed + "ms");
    System.out.println((elapsed * 1000.0) / 1000000 + " microseconds per execution");
}

That's all there is: a loop, an operation, and some timing code. I run this on my laptop, and it says that 1,000,000 iterations takes somewhere around 2500 milliseconds — or about 2.5 microseconds per iteration.

So what does this tell you? Well, one thing that it can tell you is that calling replaceAll() isn't that expensive in the grand scheme. Certainly far less expensive than a database call, or writing data to a socket, or really, anything else that is likely to happen when executing a JSP. However, perhaps we can do better with concatenation — after all, why pass up a chance for optimization?

public static void main(String[] argv)
throws Exception
{
    long start = System.currentTimeMillis();
    for (int ii = 0 ; ii < 1000000 ; ii++)
    {
        String s = "this " + "is" + " a test";
    }
    long elapsed = System.currentTimeMillis() - start;
    System.out.println("elapsed time = " + elapsed + "ms");
    System.out.println((elapsed * 1000.0) / 1000000 + " microseconds per execution");
}

Running this on my laptop, I got 10ms total execution time! But do you see the bug?

Pitfall #1: Make sure you're measuring what you think you are

The bug, of course, is that we're concatenating three string literals, and the compiler does this for us. So what we're really testing is the speed of assignment. Changing the code so that it actually does a concatenation on each loop, as below, we get a more reasonable time of 350-375 milliseconds per run.

public static void main(String[] argv)
throws Exception
{
    long start = System.currentTimeMillis();
    for (int ii = 0 ; ii < 1000000 ; ii++)
    {
        String s1 = "this ";
        String s2 = "is";
        String s3 = " a test";
        String s = s1 + s2 + s3;
    }
    long elapsed = System.currentTimeMillis() - start;
    System.out.println("elapsed time = " + elapsed + "ms");
    System.out.println((elapsed * 1000.0) / 1000000 + " microseconds per execution");
}

At this point, you can look at the numbers and make a snap decision: is an extra two seconds per million iterations worth rewriting your code? Sometimes it is: a program that is CPU-bound and performing an enormous number of string manipulations might indeed benefit. But to back up that decision, you need to run your actual application under a profiler, to see exactly what percentage of time is spent doing replaces.

Pitfall #2: Is this a reasonable test?

One of the chief limitations of a micro-benchmark is that it's running in an environment that almost certainly doesn't match production. My laptop, for example, is a single-core processor running the Sun 1.5 JDK. There's no way, for example, that it should be used to test thread contention.

But even with the replaceAll() test, there is a subtle issue: my laptop should never experience garbage collection delays. The Sun JVM uses a generational garbage collector, and none of these objects in this test are large enough to leave Eden. The real-world example, however, was running on a JRockit JVM and processing strings that were 100-200 characters long.

Changing the length of the strings is simple, and the -verbose:gc invocation argument will tell us if we're switching from minor to major collections. Still, there's nothing like running in the actual environment. And dropping the benchmark code into a JSP is pretty easy:

<%@page contentType="text/html"%>
<%@page pageEncoding="UTF-8"%>

<html>
<head>
    <title>Micro-Benchmark</title>
</head>
<body>
    <%
    String testString = // something long
    long start = System.currentTimeMillis();
    for (int ii = 0 ; ii < 1000000 ; ii++)
    {
        String s = testString.replaceAll("XXX", " is ");
    }
    long elapsed = System.currentTimeMillis() - start;
    %>
    <p> concat: elapsed time = <%=elapsed%> ms,
                <%=(elapsed * 1000.0) / 1000000%> microseconds per execution
</body>
</html>

Pitfall #3: Elapsed time versus Execution time

Running a micro-benchmark on an application server increases the potential for mis-measurement, because the benchmark code isn't running in isolation. Load elsewhere in the system means that your measured execution time will be greater than the actual CPU time consumed by the test.

In some cases, you may be able to measure actual CPU time: JDK 1.5 introduced the java.lang.management package and ThreadMXBean: this class will give you both total and user-mode CPU time consumed by a thread. However, this feature is not supported by all JVMs, as it relies on the underlying operating system to make the information available.

It's also overkill: the goal is to quantify the effect of a particular piece of code, not develop guarantees about its execution time. Instead, use a simple measurement strategy, execute the code multiple times, and keep track of its average execution time. Also, note that I said increases the potential for mis-measurement, rather than introduces: any computer has multiple competing processes (a fact that becomes painfully obvious when your anti-virus kicks in during a gaming session).

Pitfall #4: Cleverness

Looking at the examples so far, you may be having visions of refactoring: after all, they differ by a single line of code. Perhaps you are thinking of a micro-benchmark framework, using a functor object to encapsulate the code being tested. The risk of acting on these thoughts is that you might end up measuring the boilerplate rather than the code to be benchmarked.

Which is not to say that you should write separate programs for each test. In fact, it may be better to combine your tests in a single program, as it will allow you to “warm up” the JVM (a topic that will be discussed in more detail below). As long as you keep the benchmarked code and main loop together in a function, you're not going to skew the results. Keeping the timing markers close to the loop is also important, but not so much — the overhead of a function call or switch statement should disappear next to a million iterations of your test code. For example:

public class Micro2
{
    private static final int REPS = 1000000;
    private static final String PART1 = "for a real test, make this big";
    private static final String PART2 = "so that we can see GC effects";
    private static final String REGEX = "XXX";
    private static final String REPL = " -- ";
    private static final String TEMPLATE = PART1 + REGEX + PART2;

    private enum CodeType { concat, replace };

    public static void main(String[] argv)
    throws Exception
    {
        for (CodeType run : CodeType.values())
        {
            long start = System.currentTimeMillis();
            switch (run)
            {
                case concat :
                    executeConcat(PART1, PART2, REPL);
                    break;
                case replace :
                    executeReplace(TEMPLATE, REGEX, REPL);
                    break;
                default :
                    // do nothing
            }
            long elapsed = System.currentTimeMillis() - start;

            System.out.println(run + ": elapsed time = " + elapsed + "ms ("
                                   + (elapsed * 1000.0) / REPS + "microseconds per execution)");
        }
    }

    private static void executeReplace(String template, String regex, String repl)
    {
        for (int ii = 0 ; ii < REPS ; ii++)
        {
            String s = template.replaceAll(regex, repl);
        }
    }

    private static void executeConcat(String part1, String part2, String repl)
    {
        for (int ii = 0 ; ii < REPS ; ii++)
        {
            String s = part1 + repl + part2;
        }
    }
}

Pitfall #5: Hotspot

Here's an interesting experiment: take the first benchmark program above, and change the loop count:

Loop Count Execution Time
1000 50 ms
10000 70 ms
100000 310 ms
1000000 2500 ms

Intuitively, you'd expect a linear increase in execution time, but you don't see that (at least until the jump from 100,000 to 1,000,000 loops). There are a number of reasons for this discrepancy.

First is that Java loads classes dynamically: the first iteration pays the penalty for loading the java.util.regex classes.

Second, and more important, is that the JVM started by interpreting the Java bytecodes for the program, but after a certain point the Hotspot compiler identifies code that's frequently called and compiles it into native machine code. You can watch this in action by specifying -XX:+PrintCompilation on the Java command line:

bin, 504> java -XX:+PrintCompilation com.kdgregory.example.benchmark.Micro2
  1   b   java.lang.String::charAt (33 bytes)
  2   b   java.lang.Math::max (11 bytes)
  3   b   java.lang.String::hashCode (60 bytes)
  4   b   java.lang.String::indexOf (151 bytes)
  5*  b   java.lang.System::arraycopy (0 bytes)
  6   b   java.lang.String::getChars (66 bytes)
  7   b   java.lang.AbstractStringBuilder::append (60 bytes)
  8   b   java.lang.StringBuilder::append (8 bytes)
  9   b   java.lang.String::<init> (80 bytes)
 10   b   java.lang.AbstractStringBuilder::<init> (12 bytes)
 11   b   java.lang.AbstractStringBuilder::expandCapacity (51 bytes)
 12   b   java.lang.StringBuilder::toString (17 bytes)
 13   b   java.lang.String::valueOf (14 bytes)
 14   b   java.lang.String::toString (2 bytes)
 15   b   java.lang.StringBuilder::<init> (18 bytes)
  1%  b   com.kdgregory.example.benchmark.Micro2::executeConcat @ 5 (39 bytes)
concat: elapsed time = 661ms (0.661microseconds per execution)

This process, of identifying hot code and compiling it, will take time. And until all hotspots have been identified and compiled, your timings will misrepresent reality. In most cases, simply using a large number of iterations will bring the timings in line: the time consumed by interpreted code early in the run will be “noise” compared to the time consumed by compiled code. Alternatively, you can “warm up” the JVM by running some number of iterations before your actual test (my reading of the various Sun white-papers indicates that 10,000 iterations is sufficient).

One other feature of Hotspot — and one that is particularly damaging to micro-benchmarks — is dead code removal. If the JVM determines that a particular piece of code is not actually doing anything useful, it will simply stop executing that code. Using the -server JVM argument increases the aggressiveness of dead-code removal, as shown by the following example (which attempts to benchmark an empty loop):

bin, 527> java -XX:+PrintCompilation com.kdgregory.example.benchmark.Micro4
  1   b   java.lang.String::charAt (33 bytes)
  2   b   java.lang.Math::max (11 bytes)
  3   b   java.lang.String::hashCode (60 bytes)
  1%  b   com.kdgregory.example.benchmark.Micro4::main @ 21 (94 bytes)
pass 0 elapsed time = 70ms
pass 1 elapsed time = 80ms
pass 2 elapsed time = 70ms
pass 3 elapsed time = 70ms
pass 4 elapsed time = 70ms

bin, 528> java -server -XX:+PrintCompilation com.kdgregory.example.benchmark.Micro4
  1%      com.kdgregory.example.benchmark.Micro4::main @ 21 (94 bytes)
pass 0 elapsed time = 10ms
  2%      com.kdgregory.example.benchmark.Micro4::main @ 21 (94 bytes)
pass 1 elapsed time = 20ms
pass 2 elapsed time = 0ms
pass 3 elapsed time = 0ms
pass 4 elapsed time = 0ms

Conclusions

In a perfect world, micro-benchmarks are the second step of optimization: once you discover where your application is spending its time, you write a micro-benchmark to explore alternatives. However, even without that first step, and despite their many limitations, micro-benchmarks are a useful tool to help quantify code behavior.

Remember, however, that “help” is the operative word; their results must always be evaluated in the context of the environment. Writing a micro-benchmark so that you can say “string concatenation is four times faster than replaceAll()” is meaningless. Not only might the numbers change on a different machine or in a different program, but that speedup will make no difference if your program spends most of its time waiting for a database query.

For More Information

There are a number of examples for this article:

Oracle's white-paper gives a high-level description of the Hotspot JVM.

An article by Brian Goetz that details the ways that the Hotspot JVM can derail a micro-benchmark.

Google maintains Caliper, a tool for writing micro-benchmarks. Its chief benefit is that it will decide how many repetitions are sufficient for an accurate benchmark. But you still have to create benchmarkable code.

Copyright © Keith D Gregory, all rights reserved

This site does not intentionally use tracking cookies. Any cookies have been added by my hosting provider (InMotion Hosting), and I have no ability to remove them. I do, however, have access to the site's access logs with source IP addresses.