C++ implementation of a simple map slower than equivalent implementation in Java: Code/Benchmark Issue?

The goal of this research is to explore the performance differences between JIT (just-in-time compilation) and AOT (ahead-of-time compilation) strategies and to understand their respective advantages and disadvantages. The intent is not to claim that one language is slower or worse than the other.
In our tests, we observed better results with the HotSpot JVM 23 using JIT compilation (JVMCI and C2). We got slower results with C++ (compiled with Clang 18), GraalVM 23 (compiled with native-image), and HotSpot JVM 23 with the -Xcomp flag (JVMCI and C2). We are seeking to understand why this happens, and if possible, identify ways to improve the performance of the C++ version to match the results of Java's JIT compilation.

Our benchmark involves comparing a simple and small hash table (map) implementation in Java to an equivalent (line-by-line) implementation in C++. We made every effort to ensure consistency between the two implementations. The goal is not to compare the standard library implementations of hash table (java.util.HashMap and std::unordered_map) as they are of course implemented with non-equivalent source code.

The benchmark creates a hash table with 20,000 buckets and inserts 2,000,000 objects. We insert once, clear the map and then insert again, to take advantage of the Entry object internal object pool of the hash table. In other words, the first time you insert, objects are going to be allocated in the heap. The second time you insert objects are going to be re-used from the internal object pool.

The Bench class, which handles the measurements, should also be equivalent in both languages.

Given these details, does anyone have insights into why the C++ map implementation was slower than the Java map implementation? Could there be something we overlooked or an aspect of the C++ implementation that could be optimized further? Perhaps specific Clang optimization options we should explore?

(This section added by @petercordes to describe in more detail what the benchmark does. And from discussion in comments with the OP, to describe what I think one of the goals is.)

The C++ version is written such that it could compile to similar machine code to what a JVM might JIT for the Java version. How does Java manages to do the same machine operations as the C++ version but faster.

(Editor's note (@petercordes): I have single-stepped the machine code generated by the Hotspot JVM, and indeed the main loops are linked-list linear searches, using 32-bit compressed references vs. clang++ using native pointers. Interrupting the JVM with GDB, usually RIP points to a place where the code spends most of its time. clang++ -O3 ... -m32 is still slower, so it's not just 32-bit pointers, and the Java objects were still at least 24 bytes, not just 12 for 2 references + an int. Probably something about allocation patterns and cache misses, since Java runs more machine instructions inside the loop and even has to shl by 3 to turn a 32-bit value into a pointer, so best-case load-use latency is worse.)

The C++ version uses pointers extensively to mirror Java's semantics. The "values" stored and retrieved are just pointers (all to the same dummy object). It's probably not useful for most use-cases. For example, put takes an object by const-reference, i.e. a pointer in asm, while get returns it as a non-const pointer. The hash table never dereferences the value pointer or deletes the pointed-to object; in Java the GC takes care of this. In C++ it could hypothetically be used for building a map of objects owned by something else, or in automatic or static storage. It could equivalently take a uintptr_t by value. (A reference is non-nullable but we don't depend on that.)

The hash table is an array of linked lists (pointers in C++, with nullptr indicating on keys in that bucket). With a load-factor of 100, this is really a linked-list benchmark, not normal operating conditions for a hash table.

This was chosen to make the time for each operation long enough to measure with Linux clock_gettime as called from Java's System.nanoTime or Glibc's chrono::steady_clock::now. Measurement overhead should be fairly equal in both languages, and very high compared to just a hash-bucket lookup that hits in cache, but pretty small compared to traversing a 50-element (average length) linked list.
The Bench class itself also adds each time to a std::map in C++, or a custom hashmap in Java using a similar design to the hash table being benchmarked, used with a load factor well below 1. This is outside the timed region but does need to do some allocation on the first run, so that will happen between every allocation of a node for the hash table being benchmarked.

To attempt to minimize performance differences between Java's GC vs. non-GC C++, we use a free-list with unbounded size, only ever deleting in the destructor: if remove() finds a match, it moves the node from the bucket's linked list to the head of the free list. put takes the head of the free list, only calling new if empty. So new gets called only inside the first round of put calls, with the second (after clear) having to advance through the linked-list of nodes. Which should be cheap because there's a whole linked-list search between allocations for out-of-order execution to hide the load latency even on a cache miss.

The initial put stage is not the only one that's slower for the C++ version than the Java version, so it's not just raw performance of new. Perhaps something about locality of allocations?

All source code, which is not much, with scripts to compile and execute together with our results are in the open-source Github project CoralBench.

From the comments below, it looks like a major source of improvement for the C++ code could be custom allocators for the hash table internal Entry objects. It has been mentioned that Java has an easier/faster time allocating objects in the heap than C++, through the new keyword, but that this shortcoming can be addressed in C++ by using custom allocators. Unfortunately my knowledge of C++ is limited and I will have to do some research to understand how to do that. An answer here with such a change/improvement in the C++ code would be great.

It is important to note that on the second put benchmark, the C++ code will not be allocating any Entry in the heap, as the objects will be all available in the internal object pool (from the first put). Therefore I'm not sure why C++ is still slower on the second put benchmark.

FTR, below are our current results for Java and C++:

HotSpotVM JIT: (with Graal JVMCI JIT)

PUT => Avg: 371 ns | Min: 28 ns | 99.9% = [avg: 367 ns, max: 1.743 micros]
PUT => Avg: 613 ns | Min: 27 ns | 99.9% = [avg: 606 ns, max: 2.184 micros]
GET => Avg: 615 ns | Min: 14 ns | 99.9% = [avg: 607 ns, max: 2.549 micros]
DEL => Avg: 662 ns | Min: 18 ns | 99.9% = [avg: 658 ns, max: 2.538 micros]

HotSpotVM JIT: (with C2 JIT)

PUT => Avg: 342 ns | Min: 29 ns | 99.9% = [avg: 338 ns, max: 1.661 micros]
PUT => Avg: 596 ns | Min: 28 ns | 99.9% = [avg: 589 ns, max: 2.161 micros]
GET => Avg: 599 ns | Min: 20 ns | 99.9% = [avg: 592 ns, max: 2.275 micros]
DEL => Avg: 826 ns | Min: 23 ns | 99.9% = [avg: 817 ns, max: 3.420 micros]

C++ LLVM: (clang)

PUT => Avg: 726 ns | Min: 30 ns | 99.9% = [avg: 720 ns, max: 4.097 micros]
PUT => Avg: 857 ns | Min: 18 ns | 99.9% = [avg: 848 ns, max: 2.933 micros]
GET => Avg: 874 ns | Min: 18 ns | 99.9% = [avg: 865 ns, max: 3.010 micros]
DEL => Avg: 875 ns | Min: 19 ns | 99.9% = [avg: 871 ns, max: 2.810 micros]

GraalVM: (native-image)

PUT => Avg: 190 ns | Min: 21 ns | 99.9% = [avg: 183 ns, max: 814 ns]
PUT => Avg: 659 ns | Min: 23 ns | 99.9% = [avg: 656 ns, max: 2.762 micros]
GET => Avg: 399 ns | Min: 21 ns | 99.9% = [avg: 396 ns, max: 2.124 micros]
DEL => Avg: 323 ns | Min: 27 ns | 99.9% = [avg: 321 ns, max: 1.850 micros]

Benchmark Environment:

$ uname -a
Linux hivelocity 4.15.0-20-generic #21-Ubuntu SMP Tue Apr 24 06:16:15 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/issue | head -n 1
Ubuntu 18.04.6 LTS \n \l

$ cat /proc/cpuinfo | grep "model name" | head -n 1 | awk -F ": " '{print $NF}'
Intel(R) Xeon(R) E-2288G CPU @ 3.70GHz

$ arch
x86_64

$ clang++ --version
Ubuntu clang version 18.1.0 (++20240220094926+390dcd4cbbf5-1~exp1~20240220214944.50)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

$ java -version
java version "23.0.1" 2024-10-15
Java(TM) SE Runtime Environment Oracle GraalVM 23.0.1+11.1 (build 23.0.1+11-jvmci-b01)
Java HotSpot(TM) 64-Bit Server VM Oracle GraalVM 23.0.1+11.1 (build 23.0.1+11-jvmci-b01, mixed mode, sharing)

$ native-image --version
native-image 23.0.1 2024-10-15
GraalVM Runtime Environment Oracle GraalVM 23.0.1+11.1 (build 23.0.1+11-jvmci-b01)
Substrate VM Oracle GraalVM 23.0.1+11.1 (build 23.0.1+11, serial gc, compressed references)

To compile the C++ code:

rm -f target/cpp/int_map_benchmark target/cpp/int_map.o target/cpp/bench.o target/cpp/int_map_benchmark.o

mkdir -p target/cpp

clang++ -Ofast -march=native -flto -std=c++17 -I./src/main/c -c ./src/main/c/int_map.cpp -o ./target/cpp/int_map.o
clang++ -Ofast -march=native -flto -std=c++17 -I./src/main/c -c ./src/main/c/bench.cpp -o ./target/cpp/bench.o
clang++ -Ofast -march=native -flto -std=c++17 -I./src/main/c -c ./src/main/c/int_map_benchmark.cpp -o ./target/cpp/int_map_benchmark.o

clang++ -Ofast -march=native -flto -std=c++17 -o ./target/cpp/int_map_benchmark ./target/cpp/int_map.o ./target/cpp/bench.o ./target/cpp/int_map_benchmark.o

To run the C++ code:

#!/bin/bash

WARMUP=${1:-0}
MEASUREMENTS=${2:-2000000}
CAPACITY=${3:-20000}

./target/cpp/int_map_benchmark $WARMUP $MEASUREMENTS $CAPACITY

Below the C++ code:

// =====> bench.hpp
#ifndef BENCH_HPP
#define BENCH_HPP

#include <chrono>
#include <iostream>
#include <limits>
#include <iomanip>
#include <string>
#include <cmath>
#include <map>
#include <sstream>

class Bench {
public:
    Bench(int warmupCount = 0);
    ~Bench();

    void mark();
    void measure();
    bool measure(long long);
    void reset();
    void reset(bool);
    void printResults() const;
    void printResults(bool) const;
    bool isWarmingUp() const;
    int getIterations() const;
    int getMeasurements() const;
    double getAverage() const;

private:
    int warmupCount;
    int measurementCount;
    long long sum;
    long long minTime;
    long long maxTime;
    int size;
    std::map<long long, long long>* results;
    std::chrono::steady_clock::time_point startTime;
    
    static std::string formatWithCommas(long long value);
    static std::pair<double, std::string> formatTime(double nanos);
    static std::string formatPercentage(double perc);
    static double roundToDecimals(double d, int decimals);
    void printPercentiles() const;
    void addPercentile(double perc) const;
    double avg() const;
};

#endif // BENCH_HPP

// =====> bench.cpp
#include "bench.hpp"
using namespace std;

Bench::Bench(int warmupCount)
    : warmupCount(warmupCount),
      measurementCount(0),
      sum(0),
      minTime(numeric_limits<long long>::max()),
      maxTime(numeric_limits<long long>::min()),
      size(0) {

        results = new map<long long, long long>();

}

Bench::~Bench() {
    delete results;
}

void Bench::mark() {
    startTime = chrono::steady_clock::now();
}

void Bench::measure() {
    auto endTime = chrono::steady_clock::now();
    auto elapsed = chrono::duration_cast<chrono::nanoseconds>(endTime - startTime).count();
    measure(elapsed);
}

bool Bench::measure(long long elapsed) {

    bool isToMeasure = ++measurementCount > warmupCount;

    if (isToMeasure) {
        sum += elapsed;
        if (elapsed < minTime) minTime = elapsed;
        if (elapsed > maxTime) maxTime = elapsed;

        // Increment the frequency of this elapsed time
        auto it = results->find(elapsed);
        if (it == results->end()) {
            results->insert({elapsed, 1});
        } else {
            it->second++;
        }
        size++;
    }
    
    return isToMeasure;
}

int Bench::getIterations() const {
    return measurementCount;
}

int Bench::getMeasurements() const {
    return size;
}

void Bench::reset() {
    reset(false);
}

void Bench::reset(bool repeatWarmup) {
    measurementCount = 0;
    sum = 0;
    if (!repeatWarmup) warmupCount = 0;
    minTime = numeric_limits<long long>::max();
    maxTime = numeric_limits<long long>::min();
    results->clear();
    size = 0;
}

bool Bench::isWarmingUp() const {
    return warmupCount <= measurementCount;
}

double Bench::avg() const {
    const int effectiveCount = measurementCount - warmupCount;
    if (effectiveCount <= 0) {
        return 0;
    }
    const double avg = static_cast<double>(sum) / effectiveCount;
    const double rounded = round(avg * 100.0) / 100.0;
    return rounded;
}

double Bench::getAverage() const {
    return avg();
}   

void Bench::printResults() const {
    printResults(true);
}

void Bench::printResults(bool includePercentiles) const {

    int effectiveCount = measurementCount - warmupCount;

    string effCountStr = formatWithCommas(effectiveCount);
    string warmupStr = formatWithCommas(warmupCount);
    string totalStr = formatWithCommas(measurementCount);

    cout << "Measurements: " << effCountStr
         << " | Warm-Up: " << warmupStr
         << " | Iterations: " << totalStr << endl;
         
    if (effectiveCount > 0) {

        auto [avgVal, avgUnit] = formatTime(avg());
        auto [minVal, minUnit] = formatTime(static_cast<double>(minTime));
        auto [maxVal, maxUnit] = formatTime(static_cast<double>(maxTime));
    
        cout << fixed << setprecision(3);
        cout << "Avg Time: " << avgVal << " " << avgUnit << " | "
             << "Min Time: " << minVal << " " << minUnit << " | "
             << "Max Time: " << maxVal << " " << maxUnit << endl;
    
        if (includePercentiles) printPercentiles();
    }
    
    cout << endl;
}

string Bench::formatWithCommas(long long value) {
    string numStr = to_string(value);
    int insertPosition = static_cast<int>(numStr.length()) - 3;
    while (insertPosition > 0) {
        numStr.insert(insertPosition, ",");
        insertPosition -= 3;
    }
    return numStr;
}

pair<double, string> Bench::formatTime(double nanos) {
    if (nanos >= 1'000'000'000.0) {
        double seconds = nanos / 1'000'000'000.0;
        return {roundToDecimals(seconds, 3), seconds > 1 ? "seconds" : "second"};
    } else if (nanos >= 1'000'000.0) {
        double millis = nanos / 1'000'000.0;
        return {roundToDecimals(millis, 3), millis > 1 ? "millis" : "milli"};
    } else if (nanos >= 1000.0) {
        double micros = nanos / 1000.0;
        return {roundToDecimals(micros, 3), micros > 1 ? "micros" : "micro"};
    } else {
        double ns = nanos;
        return {roundToDecimals(ns, 3), ns > 1 ? "nanos" : "nano"};
    }
}

double Bench::roundToDecimals(double d, int decimals) {
    double pow10 = pow(10.0, decimals);
    return round(d * pow10) / pow10;
}

void Bench::printPercentiles() const {

    if (size == 0) return;

    double percentiles[] = {0.75, 0.90, 0.99, 0.999, 0.9999, 0.99999};

    for (double p : percentiles) {
        addPercentile(p);
    }
}

string Bench::formatPercentage(double perc) {
    double p = perc * 100.0;

    ostringstream oss;
    oss << fixed << setprecision(6) << p;

    string s = oss.str();
    // remove trailing zeros
    while (s.back() == '0') {
        s.pop_back();
    }

    // if the last character is now a '.', remove it
    if (s.back() == '.') {
        s.pop_back();
    }

    // Append the '%' sign
    s += "%";

    return s;
}

void Bench::addPercentile(double perc) const {

    if (results->empty()) return;

    long long target = static_cast<long long>(llround(perc * size));
    if (target == 0) return;
    if (target > size) target = size;

    // Iterate through the map to find the top element at position target
    long long iTop = 0;
    long long sumTop = 0;
    long long maxTop = -1;

    for (auto &entry : *results) {
        long long time = entry.first;
        long long count = entry.second;

        for (int i = 0; i < count; i++) {
            iTop++;
            sumTop += time;
            if (iTop == target) {
                maxTop = time;
                goto FOUND;
            }
        }
    }

FOUND:;

    double avgTop = static_cast<double>(sumTop) / iTop;
    auto [avgVal, avgUnit] = formatTime(avgTop);
    auto [maxVal, maxUnit] = formatTime(static_cast<double>(maxTop));

    cout << fixed << setprecision(3);
    cout << formatPercentage(perc) << " = [avg: " << avgVal << " " << avgUnit
         << ", max: " << maxVal << " " << maxUnit << "]\n";
}

// =====> int_map.hpp
#ifndef INT_MAP_HPP
#define INT_MAP_HPP

#include <optional>
#include <cstddef>

template <typename E>
class IntMap {

private:

    template <typename T>
    struct Entry {
        int key;
        T* value; // removed unnecessary std::optional (thanks to Paul McKenzie at https://stackoverflow.com/users/3133316/paulmckenzie)
        Entry<T>* next;
    };

    Entry<E>** data;
    int count;
    Entry<E>* head;
    const int capacity;

    Entry<E>* newEntry(int key, const E& value, Entry<E>* next) {
        Entry<E>* newEntry = head;

        if (newEntry != nullptr) {
            head = newEntry->next;
        } else {
            newEntry = new Entry<E>();
        }

        newEntry->key = key;
        newEntry->value = const_cast<E*>(&value);
        newEntry->next = next;

        return newEntry;
    }

    void free(Entry<E>* e) {
        e->value = nullptr;
        e->next = head;
        head = e;
    }

    int toArrayIndex(int key) const {
        return (key & 0x7FFFFFFF) % capacity;
    }

public:

    IntMap(int capacity)
        : capacity(capacity), count(0), head(nullptr) {
        data = new Entry<E>*[capacity];
        for (int i = 0; i < capacity; i++) {
            data[i] = nullptr;
        }
    }

    ~IntMap() {
        clear();

        while (head != nullptr) {
            Entry<E>* temp = head;
            head = head->next;
            delete temp;
        }

        delete[] data;
    }

    int size() const {
        return count;
    }

    E* get(int key) const {
        Entry<E>* e = data[toArrayIndex(key)];
        while (e != nullptr) {
            if (e->key == key) {
                return e->value;
            }
            e = e->next;
        }
        return nullptr;
    }

    E* put(int key, const E& value) {
        int index = toArrayIndex(key);
        Entry<E>* e = data[index];

        while (e != nullptr) {
            if (e->key == key) {
                E* old = e->value;
                e->value = const_cast<E*>(&value);
                return old;
            }
            e = e->next;
        }

        data[index] = newEntry(key, value, data[index]);
        count++;
        return nullptr;
    }

    E* remove(int key) {
        int index = toArrayIndex(key);
        Entry<E>* e = data[index];
        Entry<E>* prev = nullptr;

        while (e != nullptr) {
            if (e->key == key) {
                if (prev != nullptr) {
                    prev->next = e->next;
                } else {
                    data[index] = e->next;
                }

                E* oldValue = e->value;
                free(e);
                count--;
                return oldValue;
            }
            prev = e;
            e = e->next;
        }
        return nullptr;
    }

    void clear() {
        for (int index = capacity - 1; index >= 0; index--) {
            while (data[index] != nullptr) {
                Entry<E>* next = data[index]->next;
                free(data[index]);
                data[index] = next;
            }
        }
        count = 0;
    }
};

#endif // INT_MAP_HPP

// =====> int_map.cpp
#include "int_map.hpp"

// NOOP

// =====> int_map_benchmark.cpp
#include "int_map.hpp"
#include "bench.hpp"

using namespace std;

struct Dummy {};

int main(int argc, char* argv[]) {

    int warmupCount = 1000000;
    int measureCount = 1000000;
    int capacity = 100000;

    if (argc > 1) {
        warmupCount = atoi(argv[1]);
    }
    if (argc > 2) {
        measureCount = atoi(argv[2]);
    }
    if (argc > 3) {
        capacity = atoi(argv[3]);
    }

    cout << "\nArguments: warmup=" << warmupCount << " measurements=" << measureCount << " mapCapacity=" << capacity << endl << endl;

    int iterations = warmupCount + measureCount;

    IntMap<Dummy>* map = new IntMap<Dummy>(capacity);
    Bench bench(warmupCount);
    Dummy dummy;

    cout << "Benchmarking put on empty map... (1) => creating new Entry objects" << endl;
    for (int i = 0; i < iterations; i++) {
        bench.mark();
        map->put(i, dummy);
        bench.measure();
    }
    bench.printResults();
    
    cout << "Benchmarking put after clear()... (2) => hitting the pool of Entry objects" << endl;
    map->clear(); // clear the map, but the entry pool will be there
    bench.reset(true);
    for (int i = 0; i < iterations; i++) {
        bench.mark();
        map->put(i, dummy);
        bench.measure();
    }
    bench.printResults();

    cout << "Benchmarking get..." << endl;
    bench.reset(true);
    for (int i = 0; i < iterations; i++) {
        bench.mark();
        volatile auto val = map->get(i); // volatile to prevent optimization out
        bench.measure();
    }
    bench.printResults();

    cout << "Benchmarking remove..." << endl;
    bench.reset(true);
    for (int i = 0; i < iterations; i++) {
        bench.mark();
        map->remove(i);
        bench.measure();
    }
    bench.printResults();

    delete map; 
    return 0;
}

Disclaimer: I'm one of the developers of the open-source project CoralBench

Solution

Your code reads like Java converted to C then translated to C++. It really isn't idiomatic C++. Your use of volatile blocks an unknown set of optimizations. You are storing pointers to data inside your table, instead of actual data, for no good reason as far as I can tell, probably because Java uses reference semantics by default.

Part of what makes C++ faster is that it embraces values. GC languages push you towards using references (pointers) instead of values, and pointers have performance problems. Because your code is Java-based, your code actively avoids having values.

template <typename E>
struct IntMap {
  struct Entry {
    std::size_t key;
    std::optional<E> value;
  };
  std::size_t m_count = 0;
  std::vector<std::vector<Entry>> m_data;

  std::size_t toArrayIndex(std::size_t key) const {
    return key % m_data.size();
  }

  IntMap(std::size_t capacity)
    : m_data(capacity)
  {}

  std::size_t size() const {
    return m_count;
  }

  std::optional<E> get(std::size_t key) const {
    std::vector<Entry> const& es = m_data[toArrayIndex(key)];
    for (Entry const& e:es) {
      if (e.key == key)
        return e.value;
    }
    return std::nullopt;
  }

  std::optional<E> put(std::size_t key, const E& value) {
    std::vector<Entry>& es = m_data[toArrayIndex(key)];
    for (auto& e:es) {
      if (e.key == key) {
        auto old = std::move(e.value);
        e.value = value;
        return old;
      }
    }
    es.push_back( {key, value } );
    m_count++;
    return std::nullopt;
  }

  std::optional<E> remove(std::size_t key) {
    std::vector<Entry>& es = m_data[toArrayIndex(key)];
    for (Entry& e:es) {
      if (e.key == key) {
        auto old = e.value;
        std::swap( e, es.back() );
        es.resize( es.size()-1 );
        return old;
      }
    }
    return std::nullopt;
  }

  void clear() {
    for (auto& es:m_data)
      es.clear();
    m_count = 0;
  }
};

this is far more idiomatic in C++; I'd probably also replace the get functions returning E* with std::optional<E>.

It is about 3x to 5x faster than your version (the max time for a remove is actually almost 100x faster than your version, but 3-5 is for average cases).

Java has the ability to relocate memory; C++ does not. This allows Java's "nearly forced" heap-based objects to be moved into more coherent clumps, reducing cache misses and the like.

C++ has strong support for value semantics. This lets you put actual data right where you want it. I used value semantics to get rid of linked lists and incoherent heap allocations.

C++'s implementation is a (dynamic) array of (dynamic) arrays of key-value pairs, with the outer array never changing size. To reach elements, you have to traverse 2 pointers. If I was being clever I'd get rid of the double pointer indirection and drop it down to 1+epsilon.

Your Java based implementation is a (dynamic) array of pointers to linked lists of pointers to elements. To reach elements (which your benchmark does not do) you have to traverse 3+(N/Capacity) pointers.

Now, this is all sort of silly, because a good quality hash system usually avoids mass bucket clumps by having more buckets, instead of stuffing things in over-bursting buckets.

Live example - change #if 0 to #if 1 to toggle which version is active.