coptimizationhashmapdata-oriented-design

Is it OK to to add negligible risks of errors for performance gains? (fixed hash map bucket size)


I have recently been making a custom hash map for a project in C.

My hash map currently works like this:

One problem I currently have with this hash map is that creating it is way too slow for my use case (I have to create a lot of them, in the range of millions).

A simple solution for improving creation speed would have been to only allocate the buckets when needed, but that would essentially just delay the allocations and the performance gain would be minimal.

One idea which also came to mind was to just have a fixed-size array of items per bucket. That way I would only have to make one large heap allocation per map if done right. However, there would be a slight risk of a non-resolvable overflow of the bucket array, especially with a small size. I have calculated that, starting with 32 buckets and a fixed capacity of 18 items, that probability would be around 10^(-19) (if my math is correct), and with more buckets it would grow even smaller. Therefore, the possibility of that error occurring would essentially be negligible.

Especially in the spirit of data-oriented design, I find this concept of a hash map quite interesting, but I couldn't find anything on whether or not ignoring negligible risks of errors is a programming practice that can, is or should even be used at all. I would really like to know if this is a known practice anywhere and if it can be found somewhere else than in hash maps.


Solution

  • As always the answer is "it depends". Thinking about this problem in the most general way possible there are definitely real consacrated use cases where results are purposefully incorrect for the sake of performance (think distributed systems, eventual consistency as one example). However for any serious system that uses such systems there is an in depth analysis of the tradeoffs (math proofs, market analysis etc.) and a well understanding and acceptance of the tradeoffs.

    Going back to your example. For the sake of simplicity lets change the problem a bit. Let's say that in the event of an error the user would get an error message and the application can easily be restarted by the user. That seems fairly reasonable. But is this an acceptable tradeoff you are willing to make for the performance benefits? Only you can answer.

    Now your real use case is muddied because when the array is overflowed your program has Undefined Behavior. You can't know how the program will behave then. You can't even know when and if the program is wrong. Anything can happen. The program can appear to work, can crash, can start giving strange results, really anything. Is that an acceptable tradeoff? Again, only you can answer.

    But my two cents is that you should aim first and foremost for a correct program. Undefined behavior is really something you don't want in your program (besides buggy behavior it is also a security problem). There are definitely ways to speedup allocations, e.g. by using pools, taking care of fragmentation etc. Or there are other kinds of hash maps that are better suited for particular use cases. The subject is well studied. I would suggest you explore those and strongly suggest you don't purposefully add undefined behavior to your program.