javaperformance

Understanding Java List performace


I am working on a project that generates tests for a given data structure in Java. At the moment it chooses 5 random functions to run on 2 different list implementations and times them.

I noticed that the the time it takes to run the tests has a strange pattern of increasing exponentially for several seconds (10 or 20) before suddenly dropping down to what it started at. I checked how much memory was being used and it looks like the memory usage increases before dropping down again at the same time the timing goes back to normal. Further testing I have found that this only happens when the tests include 'add' and 'remove' with an index somewhere in the middle on a linked list however with array list i cant find a pattern.

I expect that these functions will take more time however I didn't expect them to increase in time when the size of the list is remaining mostly the same. Is this possibly something to do with the garbage collector or am I doing something odd. Sorry if I've explained badly. Let me know if I need to clarify anything

import java.util.LinkedList;
import java.util.Random;

public class Tester {
    private LinkedList<Integer> list = new LinkedList<>();
    private int aim_len = 100000;
    private int reaction = 500;

    private void react() {
        while(Math.abs(list.size() - aim_len) > reaction) {
            if(list.size() - aim_len > 0)
                for(int i = 0; i < reaction; i++)
                    list.remove(0);
            else
                for(int i = 0; i < reaction; i++)
                    list.add(0);
        }
    }

    public Tester() {
        Random rand = new Random();

        // Loop forever
        while(true) {
            // Keep list roughly the same size.
            react();

            // Run 500 tests.
            long start = System.nanoTime();
            for(int i = 0 ; i < 500; i++) {
                list.add(rand.nextInt(list.size()), 0);
                list.remove(rand.nextInt(list.size()));
            }
            System.out.println(Math.log(System.nanoTime() - start));
        }
    }

    public static void main(String[] args) {
        new Tester();
    }
}

Sorry about this truly ugly code. This seems to have the same effect.

It is also worth noting that the react() function being called does not line up with the change in performance.

This is a graph showing the time taken for the tests in the first few seconds of the program running.

Graph going up :(

This is a graph showing the random drop in time taken several seconds into running.

Graph jumping down


Solution

  • Java isn't like a predictable turing machine. You're thinking: I have a bunch of code, it doesn't include any randomness or dependencies on things that take inconsistent amounts of time. Therefore, surely each loop must last exactly as long, and averaging a few runs out will thus tell me precisely how much time this code takes. It is measured in milliseconds or even nanoseconds, and the granularity is 0, i.e. it is exact.

    But that's just not how any of this works. Computers don't work that way. CPUs dont work that way. JVMs definitely do not work that way.

    A few examples:

    pre-emption

    CPUs have a bunch of cores in them, but OSes do absolutely mindboggling amounts of work in the background. The obvious (some music is playing, and the buffers are nearly empty, so some CPU needs to chug through an audio decompressor algorithm and fill that buffer again), but also the less obvious: disk filesystems that want to do some checksum verification, app updaters, and so on.

    In other words, there are far more tasks to do than cores are available. OSes solve this with pre-emption: They 'forcefully' yoink a process off their core and force that core to work on something else for a while. I'm vastly oversimplifying here, but, in essence, this can happen during your 'test':

    Loop3 is 'errant'; if you average out this run you draw the incorrect conclusion that each loop takes 7.5msec, whereas the right answer is actually 5.

    Memory pages

    In your simplistic model of how computers work, you think '... and there is memory, which the CPU can read and write'.

    This is wrong.

    Computers have memory, yes. And CPUs cannot write or read from it at all.

    Instead, there's a memory controller and cache banks. The CPU can only read and write to its cache banks, and can communicate with the memory controller to ask it to copy entire pages worth of memory from your memory banks into cache pages and back again. This takes a very long time from the perspective of a CPU!

    The way this works is all rather hidden from you and even machine level code: You really do write a MOV EAX, [SI] machine code instruction and your CPU appears to just run that and it appears to just work, but the CPU is a tiny little computer with its own OS, and that one instruction might appear to take the same time as over a 1000 (!!!) instructions because the core's internal OS tries to map that memory pointer onto a cache page, realizes it's not there at all, finds the cache page that has least recently been used, sends a ping to the memory controller to flush that back to memory banks, then load the page of memory including [SI], and put it in there. The CPU then goes to sleep and waits for the memory controller to report that it has finished this job and over a 1000 cycles worth of time passes.

    You can't directly control any of this and a JVM does not even guarantee that CPUs work that way. (They tend to; it's kinda how 'we' know how to make fast CPUs. But they didn't work like that 20 years ago and they might not work like this 20 years from now). Getting pre-empting markedly increases the chance of a page fault.

    A single page fault can massively throw off your measurement.

    Hotspot

    JVMs are incredibly slow, considering. They just run java code in the dumbest way possible: Read the instruction and do it. No optimization or pre-compilation: It doesn't even re-order or otherwise apply any optimization tricks. Even worse, it does a whole bunch of bookkeeping too. For example, For every branching instruction (an if) it just makes 2 counters and counts how often the branch went 'one way' and how often it went 'the other way'. For funsies. This takes even more time, of course.

    It does all that because it actually leads to really fast execution: All this bookkeeping and 'dumb' execution means the JVM has a good idea of what's happening and where most time is spent. The vast, vast majority of apps spend, literally, 95% of all CPU resources on less than 1% of your code. There is no point whatsoever running code quickly except that 1%, that's called 'the hot path'. It's literally faster to spend twice as long on any code on the cold path (99% of the total code!) if that means you have enough bookkeeping to introduce a slight optimisation to the hot path that means the hot path runs 10% faster.

    The JVM does this, and it determines the hot path dynamically. That means some code has this peculiar performance curve:

    If X is, say, 200, and you 'average out the first 200 loops', you get a completely wrong answer: (200*50 + 800 + 50*4)/250 = 44msec whereas the only meaningful answer here is '4 msec'.

    CPUs do their own predictive branching and such. Just like every other section, this is vastly oversimplified and serves only to explain why it is completely impossible to get a CPU to meaningfully give you 'constant performance'.

    Throttling and performance cores

    Modern CPUs can run faster than their power and cooling infrastructure ca handle. This isn't a problem because they are designed to 'burst': Being CPU bound usually doesn't last long so they can emit far more heat than the system can wick off for a short while because they are assuming that very soon afterwards they'll have nothing to do and the cooling infra can 'catch up' and wick it away. Stuff in the cloud is all virtualized and uses burst credits on vCPUs with the same logic.

    This, naturally, makes any attempt to measure performance very difficult.

    Amortized costs

    The performance characteristics of a data structure aren't, necessarily, 'constant'. And ArrayList is one such structure with non-constant cost.

    Yes, the cost of, say, 'add an element at the end of the list' is best explained as O(1) (the time this takes does not meaningfully change regardless of how large your list has grown; the millionth call is not slower than the 10th). But that's oversimplified.

    ArrayLists do have a capacity. By default, it is 10: They cannot store more than 10 things. They have reserved some memory to hold 10 things (even if there aren't 10 things in there yet, they did reserve the space, so adding an element is near instantaneous; they already know where to put it, the space where the next item to add would go is "free" and the list knows that, so it can just put it there without checking).

    So what happens if you add an eleventh element?

    The ArrayList makes a completely new item storage that is larger than the item storage it has, copies all of its previous storage over, and then ditches its previous storage option. It's like someone has a filing cabinet, and when asked to add an item when it is full, the person goes out to a store, buys an entirely new, larger file cabinet, moves all their documents to the new filing cabinet, puts the old filing cabinet out on the street for garbage collection, and only then goes: So, where were we? Oh, right, you want to add this file. Yeah sure, no problem, now I have room for it.

    And it will continue to be able to swiftly respond to any 'add an element at the end' requests until the new filing cabinet is full. At which point it does the whole routine again: Buy an even bigger cabinet, and so forth.

    If you know beforehand how much you'll be adding, simply make your list with new ArrayList<String>(5812); (if you know you will add no more than 5812 items to this thing ever) and you never run into this.

    Naturally, if you are the 'unlucky' call to .add(elem) that happens to force the arraylist into moving all its stuff into a bigger filing cabinet, that one call takes orders of magnitude longer.

    Such costs are generally ignored because they are 'amortised' - you get to divide their significant cost over all the calls it 'covers', and the sizes of the new filing cabinets being ordered are exponential too, so the grand total 'moving costs' never amount to a significant part of the total, and thus O(1) remains valid (O(1) means: The one millionth call is just as fast as the tenth; the size of the structure has no impact on the performance of this operation. You could make an argument for O(log n), but that's not significantly slower than O(1) is for most pragmatic purposes).

    But, trying to list out how long each call takes naturally gets thrown way off by such things.

    The solution - JMH + awareness

    You need the use the Java Microbenchmark Harness to do performance testing. It requires you to jump through quite a few hoops, and there are many caveats about what you're calling and how you write your code. As the above explanations of why computer performance is not constant should attest (and I only covered the tip of the iceberg!) this is difficult, and it is fundamentally so: Without understanding these complications it's simply not possible to do it right.

    JMH takes care of as much of this as it can: It will 'warm up' your loops (run them a bunch disregarding how long it takes to make sure hotspot has kicked in), disregard errant timings, increase thread priority levels to ensure the odds of getting pre-empted are lower, and so on.