I was doing my homework for algorithms course, and it was about string sorting algorithms. I was to implement different algorithms, count the number of symbol comparisons, draw a graph and explain the results. The results of one of these algorithms really confuse me.
The algorithm is a hybrid of MSD radix sort and string quick sort: when the size of the subarray being sorted is less than some threshold, it is sorted via quick sort variation, that is optimised for strings, rather than with radix sort.
Here is the resulting graph:
Here on x-axis are sizes of input arrays, on y-axis are average number of comparisons made amongst 50 randomly generated arrays of the respective size. Arrays contain random strings of length 10-200 symbols (the length is also random). The chosen value of the threshold is 50. The local maximum is at the input size of around 3000.
Here are the results for threshold = 25:
The pattern is the same, though the local maximum is at 1500.
The pattern seems to be the following: at relatively small sizes of arrays the number of comparisons is peaked at some specific size, whilst on large inputs the number of comparisons is increasing. The latter is quite natural, but the behavior on smaller arrays is quite obscure.
As radix sort is not a comparison-based sorting algorithm, all comparisons are made by quick sort. Seems like with arrays of some specific sizes given quick sort is to deal with worst case arrays (the ones that make it O(n^2)) more often than usual. Why does it happen?
My code:
Here index
is a function that takes a char and returns its index in the array of allowed symbols. It respects the order in ascii table, and the index of '\0'
is also 0. It may be considered to behave the same as casting a char
to int
.
struct RadixSortStep {
size_t l;
size_t r;
size_t c;
};
void radix_sort(std::vector<std::string> &arr, CompCounter &cmp, bool sw) {
std::queue<std::string> qs[allowed.size() + 1];
std::queue<RadixSortStep> trace{};
trace.push({0, arr.size() - 1, 0});
while (!trace.empty()) {
auto [l, r, c] = trace.front();
trace.pop();
if (l >= r) {
continue;
}
if (sw && r - l <= threshold) {
string_quick_sort(arr, cmp, l, r, c);
continue;
}
for (size_t i = l; i <= r; ++i) {
char sym = arr[i][c];
auto ind = index(sym);
auto &q = qs[ind];
auto &s = arr[i];
q.push(std::move(s));
}
size_t start = l;
for (size_t i = 0; i <= allowed.size(); ++i ) {
auto &q = qs[i];
if (q.empty()) {
continue;
}
size_t end = start + q.size() - 1;
size_t j = start;
while (!q.empty()) {
arr[j++] = std::move(q.front());
q.pop();
}
if (i != 0 && start < end) {
trace.push(RadixSortStep{start, end, c + 1});
}
start = end + 1;
}
}
}
void string_quick_sort(std::vector<std::string> &arr, CompCounter &cmp, size_t l, size_t r, size_t c) {
if (l >= r) {
return;
}
auto pivot = arr[l];
bool short_pivot = pivot.size() == c;
size_t el = l;
size_t er = l;
for (size_t k = l + 1; k <= r; ++k) {
if (arr[k].size() == c) {
if (short_pivot) {
std::swap(arr[er + 1], arr[k]);
++er;
} else {
std::swap(arr[el], arr[k]);
std::swap(arr[k], arr[er + 1]);
++el;
++er;
}
} else if (!short_pivot) {
int cm = cmp.cmp(arr[k][c], pivot[c]);
if (cm < 0) {
std::swap(arr[el], arr[k]);
std::swap(arr[k], arr[er + 1]);
++el;
++er;
} else if (cm == 0) {
std::swap(arr[er + 1], arr[k]);
++er;
}
}
}
if (el > 0) {
string_quick_sort(arr, cmp, l, el - 1, c);
}
string_quick_sort(arr, cmp, er + 1, r, c);
if (!short_pivot || l != el || r != er) {
string_quick_sort(arr, cmp, el, er, c + 1);
}
}
CompCounter
is a utility class that keeps track of symbol comparisons in the field count
. The only method of it that matters is cmp
int CompCounter::cmp(char c1, char c2) {
++count;
return c1 - c2;
}
Update: as @UlrichEckhardt suggested, I changed quick sort implementation so that it now uses middle or random element of the subarray as a pivot:
static std::mt19937 gen(std::random_device{}());
void string_quick_sort(std::vector<std::string> &arr, CompCounter &cmp, size_t l, size_t r, size_t c) {
if (l >= r) {
return;
}
// auto pivot = arr[(r + l) / 2]; // <- pivot in the middle
// random pivot
auto index = std::uniform_int_distribution<size_t>(l, r)(gen);
auto pivot = arr[index];
bool short_pivot = pivot.size() == c;
size_t lt = l;
size_t eq = l;
size_t gt = r;
while (eq <= gt) {
if (arr[eq].size() == c) {
if (short_pivot) {
++eq;
} else {
std::swap(arr[eq++], arr[lt++]);
}
} else if (short_pivot) {
std::swap(arr[eq], arr[gt--]);
} else { // if (!short_pivot)
int cm = cmp.cmp(arr[eq][c], pivot[c]);
if (cm < 0) {
std::swap(arr[eq++], arr[lt++]);
} else if (cm == 0) {
++eq;
} else { // if (cm > 0)
std::swap(arr[eq], arr[gt--]);
}
}
}
if (lt > 0) {
string_quick_sort(arr, cmp, l, lt - 1, c);
}
string_quick_sort(arr, cmp, gt + 1, r, c);
if (!short_pivot || l != lt || r != gt) {
string_quick_sort(arr, cmp, lt, gt, c + 1);
}
}
I tried both implementations, but it made little difference: the pattern itself and local maximums are exactly the same, though the absolute number of comparisons seems to increase. Here is a graph for randomized pivot with threshold = 50 as an example:
The graph for pivot in the middle is nearly identical.
It seems that this pattern has nothing to do with worst case scenario arrays at all! Now I am even more confused...
When RadixSort step is performed for the first string symbol, array of size M is splitted into N groups, where N is the size of the string alphabet (number of distinct characters used for string generation). And each group size should be close to M/N if source data is really random.
Assume that threshold value is T.
While M/N is less than T (M below T*N), after first (and single!) RadixSort step each of N groups of size M/N are sorted via QuickSort, so number of comparisons should be N * O((M/N) * log(M/N)) = O(M * log(M/N)) = O(M * log(M) - M * log(N)).
But when M/N reaches T (M reaches T * N), another RadixSort step is performed (initially for some groups, but then for all), and initial array is split into N^2 groups of size M/(N^2). Now number of comparisons should be (N^2) * O(M/(N^2) * log(M/(N^2)) = O(M * log(M/(N^2)) = O(M * log(M) - M * log(N^2)) = O(M * log(M) - 2 * M * log(N)), so it is decreased by O(M * log(N)).
If M will grow further and reach T * N^2, there are should be another local maximum on the graph and so on for T * N^3, T * N^4, ...
In your case (maximum at M=3000 for T=50) alphabet size N should be 60. So, another maximum is expected at array size of 50 * 60 * 60=180000, but it's much more than graph's high limit of 15000.
For second case (maximum at M=1500 for T=25) alphabet size is the same, and another maximum is expected at 25 * 60 * 60=90000, but it's also too much to graph.