Given some interval [a, b] of indices (64-bit unsigned integers), I would like to quickly obtain an array that contains all of these indices ordered according to a uniformly distributed hash function, appearing random but actually being the same on every system, regardless of the used C++ implementation.
The goal is to find highly optimized such methods. You may use shared memory parallelism via Intel's oneTBB to improve performance.
Something like
vector<uint64_t> distributeIndices(uint64_t from, uint64_t to) {
unordered_set<uint64_t> uset;
for (uint64_t i = from; i <= to; i++)
uset.insert(i);
return vector<uint64_t>(uset.begin(), uset.end());
}
would generate desired results if unordered_set<uint64_t>
would always use the same and a pseudo-randomly distributed hash function on every implementation, both of which is not the case. It would also be an inefficient solution. TBB equivalent:
tbb::concurrent_vector<uint64_t> distributeIndices(uint64_t from, uint64_t to) {
tbb::concurrent_unordered_set<uint64_t> uset;
tbb::parallel_for(from, to + 1, [&uset](uint64_t i) {
uset.insert(i);
}); // NOTE: This is only to illustrate a parallel loop, sequential insertion is actually faster.
return tbb::concurrent_vector<uint64_t>(uset.begin(), uset.end());
}
Note that distributeIndices(from, to)
should return a random-looking permutation of {from, ..., to}.
Merely providing some hash functions is insufficient, and none of the answers at "Generating a deterministic int from another with no duplicates" actually answered that question.
Consider transform
from this answer. Notably, a cyclic distribution is not a pseudo-random distribution:
(uint64_t a, uint64_t b) { return transform(a) < transform(b) }
(uint64_t a, uint64_t b) { return transform(a) % n < transform(b) % n }
x
in {from, ..., to} to transform(x - from) % n + from
distributeIndices(42, 42+99999999)
happens to be bijective (since 100000000
and 39293
are coprime), but distributeIndices(42, 42+99999999)[0, ..., 999]
looks not random at all:
distributeIndices(42, 42+3929299)
is not bijective. It assigns only 100 different elements, cycling with a period of 100:x
in {from, ..., to} to transform(x - from) + from
distributeIndices(42, 42+99999999)
is not bijective, e.g. it assigns 3929375282657 > 42+99999999
.In particular, a linear congruential generator is not in general a bijection. But if you can make it such for each interval [from, to], while also hiding its cyclic nature, how?
Consequently, an answer should provide a specific hash function (and why it is fast and uniformly distributed), and how to efficiently utilize it in order to compute distributeIndices(from, to)
.
Again, it is crucial that distributeIndices(from, to)
has the same result regardless of where it runs and what compiler it used, which must be guaranteed according to the C++ standard. But it is fine if for example distributeIndices(0,2)
assigns a different index to 1
than distributeIndices(0,3)
does.
Acceptable return types are std::vector, tbb::concurrent_vector, and dynamic arrays, consisting of elements of type uint64_t
.
The function should perform well on ranges that include billions of indices.
[ In case you are curious on why this could be useful: Consider that there are different processes on different computing nodes, communicating via Message Passing Interface, and they should not send actual data (which is big), but only indices of data entries which they are processing. At the same time, the order to process data should be pseudo-randomized so that the progress speed is not "bouncing" much (which it does, when processing along the ordered indices). This is essential for reliable predictions of how long the overall computation will take. So every node must know which transformed index refers to which actual index, i.e. every node must compute the same result for distributeIndices(from, to)
. ]
The fastest correctly working solution wins the accepted answer.
I will test solutions with GCC 11.3 -O3
on my old i7-3610QM laptop with 8 hardware threads on 100 million indices (i.e. distributeIndices(c, c + 99999999)
), and may change accepted answers when future answers provide a better performing solution.
Testing code (run up to 10 times, pick fastest execution):
int main(int argc, char* argv[]) {
uint64_t c = argc < 3 ? 42 : atoll(argv[1]);
uint64_t s = argc < 3 ? 99999 : atoll(argv[2]); // 99999999 for performance testing
for (unsigned i = 0; i < 10; i++) {
chrono::time_point<chrono::system_clock> startTime = chrono::system_clock::now();
auto indices = distributeIndices(c, c + s);
chrono::microseconds dur = chrono::duration_cast<chrono::microseconds>(chrono::system_clock::now() - startTime);
cout << durationStringMs(dur) << endl;
// [... some checks ...]
#if 0 // bijectivity check
set<uint64_t> m = set<uint64_t>(indices.begin(), indices.end());
cout << "min: " << *m.begin() << " , max: " << *prev(m.end()) << ", #elements: " << m.size() << endl;
#endif
cout << "required average: " << round((2.0L * c + s) / 2, 2) << endl;
long double avg = accumulate(indices.begin(), indices.end(), __uint128_t(0)) / static_cast<long double>(indices.size());
string sa = round(avg, 2);
cout << "actual average: " << sa << endl;
auto printTrendlineHelpers = [](uint64_t minX, string avgX, uint64_t maxX, uint64_t minY, string avgY, uint64_t maxY) {
cout << "Trendline helpers:" << endl;
cout << "[max] " << minX << " " << maxY << " " << avgX << " " << maxY << " " << maxX << " " << maxY << endl;
cout << "[avg] " << minX << " " << avgY << " " << avgX << " " << avgY << " " << maxX << " " << avgY << endl;
cout << "[min] " << minX << " " << minY << " " << avgX << " " << minY << " " << maxX << " " << minY << endl;
};
// Print some plottable data, for e.g. https://www.rapidtables.com/tools/scatter-plot.html
unsigned plotAmount = 2000;
auto printPlotData = [&](uint64_t start, uint64_t end) {
long double rng = static_cast<long double>(end - start);
long double avg = accumulate(indices.begin() + start, indices.begin() + end, __uint128_t(0)) / rng;
cout << "\ndistributeIndices(" << c << ", " << c << "+" << s << ")[" << start << ", ..., " << end - 1 << "]: (average " << round(avg, 2) << ")" << endl;
stringstream ss;
for (unsigned i = start; i < end; i++)
ss << i << " " << indices[i] << (i + 1 == end ? "" : " ");
cout << ss.str() << endl;
printTrendlineHelpers(start, round(start + rng / 2, 2), end - 1, c, sa, c + s);
};
printPlotData(0, plotAmount); // front
printPlotData(indices.size() / 2 - plotAmount / 2, indices.size() / 2 + plotAmount / 2); // middle
printPlotData(indices.size() - plotAmount, indices.size()); // back
#if 1 // Print average course
if (s >= 1000000)
plotAmount *= 10;
stringstream ss;
for (uint64_t start = 0; start < indices.size(); start += plotAmount) {
uint64_t end = min(start + plotAmount, indices.size());
uint64_t i = start + (end - start) / 2;
long double avg = accumulate(indices.begin() + start, indices.begin() + end, __uint128_t(0)) / static_cast<long double>(end - start);
ss << i << " " << round(avg, 2) << (end == indices.size() ? "" : " ");
}
cout << "\nAverage course of distributeIndices(" << c << ", " << c << "+" << s << ") with slices of size " << plotAmount << ":\n" << ss.str() << endl;
printTrendlineHelpers(c, sa, c + s, c, sa, c + s);
break;
#endif
}
return 0;
}
uint64_t from
and uint64_t to
cannot be considered constexpr
.My two (unsuitable) examples would be 14482.83 ms (14 s 482.83 ms)
and 186812.68 ms (3 min 6 s 812.68 ms)
.
The second approach seems terribly slow, but upon closer inspection it is the only one that on my system actually distributes the values:
unordered_set<uint64_t>
variant:tbb::concurrent_unordered_set<uint64_t>
variant:An exemplary distribution suggesting randomness can be obtained from olegarch's answer, as of Apr 30, 2023.
// LCG params from: https://nuclear.llnl.gov/CNP/rng/rngman/node4.html
std::vector<uint64_t> distributeIndices(uint64_t lo, uint64_t hi) {
uint64_t size = hi - lo + 1;
std::vector<uint64_t> vec(size);
for(uint64_t i = 0; i < size; i++)
vec[i] = i + lo;
uint64_t rnd = size ^ 0xBabeCafeFeedDad;
for(uint64_t i = 0; i < size; i++) {
rnd = rnd * 2862933555777941757ULL + 3037000493;
uint64_t j = rnd % size;
uint64_t tmp = vec[i]; vec[i] = vec[j]; vec[j] = tmp;
}
return std::move(vec);
}
Note that the solution is still incorrect since it doesn't provide uniform distributions for all ranges, as shown further below. It also does not utilize parallel computing, but it performs well: Computing 100 million indices took 3235.18 ms
on my i7-3610QM.
distributeIndices(42, 42+99999999)[0, ..., 1999]
with polynomial trendline:distributeIndices(42, 42+99999999)[49999000, ..., 50000999]
with polynomial trendline:distributeIndices(42, 42+99999999)[99998000, ..., 99999999]
with polynomial trendline:distributeIndices(42, 42+99999999)
with polynomial trendline:distributeIndices(0, 67108863)
with polynomial trendline:distributeIndices(0, 67108863)[0, ..., 1999]
with polynomial trendline:An exemplary distribution with a flawless trendline can be obtained from Severin Pappadeux's answer, as of Apr 30, 2023. Following the suggestions, I added some parallelization.
uint64_t m = 0xd1342543de82ef95ULL; // taken from https://arxiv.org/pdf/2001.05304.pdf
uint64_t c = 0x1ULL;
inline auto lcg(uint64_t xi) -> uint64_t { // as LCG as it gets
return m*xi + c;
}
inline auto cmp_lcg(uint64_t a, uint64_t b) -> bool {
return lcg(a) < lcg(b);
}
auto distributeIndices(uint64_t from, uint64_t to) -> std::vector<uint64_t> {
uint64_t size = to - from + 1;
std::vector<uint64_t> z(size);
tbb::parallel_for(uint64_t(0), size, [&](uint64_t i) {
z[i] = from + i;
}); // instead of std::iota(z.begin(), z.end(), from);
tbb::parallel_sort(z.begin(), z.end(), cmp_lcg); // instead of std::sort(z.begin(), z.end(), cmp_lcg);
return z;
}
To give an idea of performance boost via multithreading, computing 100 million indices on my i7-3610QM took 15925.91 ms
sequentially and 3666.21 ms
with parallelization (on 8 hardware threads).
On a computing cluster with Intel Xeon Platinum 8160 processors, I measured the (#cpu,duration[ms])
results (1,19174.65)
, (2,9862.29)
, (4,5580.47)
, (8,3402.05)
, (12,2119.28)
, (24,1606.78)
, and (48,1330.20)
.
It should also be noted, that the code is better optimized and runs much faster, when turning cmp_lcg
into a lambda function, e.g. auto cmp_lcg = [](uint64_t a, uint64_t b) -> bool { return lcg(a) < lcg(b); };
. This way, it performed best at 2608.15 ms
on my i7-3610QM. Slightly even better performance can be reached when declaring global variables m
and c
as constexpr
, or making them local or literals, which led to a duration of 2542.14 ms
.
distributeIndices(42, 42+99999999)
with polynomial trendline:distributeIndices(42, 42+99999999)[0, ..., 1999]
with polynomial trendline:After all this, it should be clear what it means that the task is to combine
I am very curious if there even exist any correct and well-performing solutions to the problem!
A negative answer to this question, with a corresponding proof, would of course also be acceptable.
Even better than distributeIndices(uint64_t, uint64_t) -> vector<uint64_t>
would be an approach to not have to create a vector, but merely iterating the indices in pseudo-random order, but that would require each pseudo-random index to be efficiently computable from its actual index (without iterating all the elements before it). I would be surprised it that is possible, but I'd gladly be surprised. Such approaches are always considered better than the vector-constructing ones, and are compared amongst each other by the duration of iterating 100 million indices.
The following solution constructs a bijective function F that maps a range of integers onto itself. This function can be used to compute a pseudo-random index directly from a source index such that the resulting pseudo-random indices are a permutation of the source indices.
There are three ideas (all borrowed from cryptography) that taken together allow for the construction of such a function: 1) a Pseudo-Random Function Family (PRF), 2) a Feistel network, and 3) a format-preserving-encryption (FPE). While these ideas draw on well-studied cryptography concepts, I believe the end product is probably unqiue and should definitely be considered insecure.
The basic strategy is to encrypt the source index to produce the
target index. The secret sauce is to design the encryption to be
bijective and use a range of integers as the domain. I have dubbed
this method feisty
for the use of a Feistel network.
The first step in the construction is creating a PRF that returns a
pseudo-random value given a 64-bit input. We can create this family
using a single function that also takes a subkey parameter that is
used to select the particular function to use. The canonical PRF
example uses AES to produce a 128-bit pseudo-random value. We will
use the following function which is more efficient to evaluate
(although much less secure) and produces a 64-bit pseudo-random
value. The s0
parameter is the source index and s1
parameter is
the subkey.
uint64_t pseudo_random_function(uint64_t s0, uint64_t s1) {
auto a = s0 + s1;
a ^= a >> 12;
a ^= a << 25;
a ^= a >> 27;
return a * 0x2545f4914f6cdd1dull;
}
This function can be used directly to order the source indices producing a pseudo-random permutation as in Severin Pappadeux's answer which is equivalent to constructing an FPE using a prefix cipher. The main difference is that this PRF produces more "random" looking results than using the linear congruent generator as shown in the following plot.
Instead of using the PRF directly, we will apply a Feistel network that leverages the PRF as its round function. The two key advantages of the Feistel network is 1) the operation is guaranteed to be invertible (i.e. bijective) even if the round function is not, and 2) the number of output bits can be selected to be at most one or two more than the number of input bits making the encoding range of the network at most four times larger than the input domain. The minimum number of rounds for security applications is suggested to be three. The following class implements a balanced Feistel network.
template<class PRF>
class FeistelNetwork {
public:
FeistelNetwork(int number_of_bits, int number_rounds, PRF&& prf)
: shift_((1 + number_of_bits) / 2)
, mask_((uint64_t{1} << shift_) - 1)
, nrounds_(number_rounds)
, prf_(std::forward<PRF>(prf)) {
}
auto encode(uint64_t msg) const {
auto [left, right] = split(msg);
for (auto i = 0; i < nrounds_; ++i)
round(left, right, Rounds[i]);
return combine(left, right);
}
auto decode(uint64_t msg) const {
auto [left, right] = split(msg);
for (int i = nrounds_ - 1; i >= 0; --i)
round(right, left, Rounds[i]);
return combine(left, right);
}
private:
std::tuple<uint64_t, uint64_t> split(uint64_t msg) const {
auto right = msg bitand mask_;
auto left = (msg >> shift_) bitand mask_;
return std::make_tuple(left, right);
}
uint64_t combine(uint64_t left, uint64_t right) const {
return (left << shift_) bitor right;
}
void round(uint64_t& left, uint64_t& right, uint64_t constant) const {
auto prf_value = prf_(right, constant) bitand mask_;
auto r = left ^ prf_value;
left = right;
right = r;
}
static constexpr uint64_t Rounds[] = {
0x88ef7267b3f978daull,
0x5457c7476ab3e57full,
0x89529ec3c1eec593ull,
0x3fac1e6e30cad1b6ull,
0x56c644080098fc55ull,
0x70f2b329323dbf62ull,
0x08ee98c0d05e3dadull,
0x3eb3d6236f23e7b7ull,
0x47d2e1bf72264fa0ull,
0x1fb274465e56ba20ull,
0x077de40941c93774ull,
0x857961a8a772650dull
};
int shift_;
uint64_t mask_;
int nrounds_;
PRF prf_;
};
If the source index range happens to be an even power of two, then we
can simply call encode
on the Feistel network to map a source index
to a pseudo-random target index. In general, however, the Feistel
network may return an encoding that is outside the source index
domain. The solution is to simply call encode
on the out-of-range
index until we get an index that is in the source index domain. This
recursion will terminate because the Feistel network encryption is
bijective and the domain is finite. For the worst case source index
range (i.e. one more than an even power of two), their will be an
average of almost four calls to encode
for a balanced network or two
for an unbalanced network. The following class implements the basic
logic along with mapping the source index domain from min,max
to
0,max-min
.
All of the code and images can be found at
GitHub in the 76076957
directory. I used the following driver for testing and generating the
performance metrics all of which use three rounds in the Feistel
network. I wrote the code for clarity and I have not yet done any
performance work, although, I think the inner loops are already pretty
efficient.
#include "core/util/tool.h"
#include "core/chrono/stopwatch.h"
#include "core/string/lexical_cast_stl.h"
template<class Work>
void measure(std::ostream& os, std::string_view desc, Work&& work) {
chron::StopWatch timer;
timer.mark();
if (work())
os << fmt::format("{:>12s}: work failed", desc) << endl;
auto millis = timer.elapsed_duration<std::chrono::milliseconds>().count();
os << fmt::format("{:>12s}: {:5d} ms", desc, millis) << endl;
}
int tool_main(int argc, const char *argv[]) {
ArgParse opts
(
argValue<'m'>("range", std::make_pair(0, 16), "Permutation range min:max"),
argValue<'r'>("rounds", 3, "Number of rounds"),
argFlag<'p'>("performance", "Measure performance"),
argFlag<'s'>("sort", "Sort index based on PRF")
);
opts.parse(argc, argv);
auto [min, max] = opts.get<'m'>();
auto rounds = opts.get<'r'>();
auto measure_performance = opts.get<'p'>();
auto sort_index = opts.get<'s'>();
if (measure_performance) {
PseudoRandomPermutation perm(min, max, rounds, &pseudo_random_function);
measure(cout, "Permutation", [&]() {
for (auto i = perm.min(); i < perm.max(); ++i) {
auto code = perm.encode(i);
if (code < perm.min() or code > perm.max())
return true;
}
return false;
});
} else if (sort_index) {
std::vector<uint64_t> codes;
for (auto i = min; i < max; ++i)
codes.push_back(i);
std::sort(codes.begin(), codes.end(), [](uint64_t a, uint64_t b) {
return iterate_prf(a, 3) < iterate_prf(b, 3);
});
for (auto elem : codes)
cout << elem << endl;
} else {
std::set<uint64_t> codes;
PseudoRandomPermutation perm(min, max, rounds, &pseudo_random_function);
for (auto i = min; i < max; ++i) {
auto code = perm.encode(i);
assert(code >= min and code <= max);
codes.insert(code);
cout << i << " " << code << endl;
}
assert(codes.size() == max - min);
}
return 0;
}
I have not done any statistical tests, but have simply eyeballed the plots and based on the eyeball tests I believe this answer satisfies the criteria:
39ns / index on Mac M1 Pro (arm64, MacOSX)
52ns / index on Intel Xeon ES-2698 @ 2.2Ghz (x86, Ubuntu 20.04)
The following two plots compare using std::shuffle
versus feisty
for creating the pseudo-random permuation for 20k indices. The third plot shows the cumulative sum of the pseudo-random indices which should be a straight line for a uniform distribution.
Just for curiosity's sake, here are plots for using from 1 to 5 rounds of the Feistel network. As suggested by theory, there needs to be a least three rounds to achieve good results.