I have a variable x
and I want to divide it into three groups with equal observations. However, using quantiles did not result in the most equal groups due to ties, as quantiles cut-off points may result in ties being allocated into more than one group. I am looking for a function or algorithm that can find the best cut-off points while ensuring that ties are not split across multiple groups.
x = c(26, 34, 27, 26, 38, 40, 34, 28, 27, 36, 29, 30, 29, 44, 30,
34, 32, 30, 26, 29, 34, 32, 38, 27, 35, 29, 28, 34, 26, 27, 27,
30, 27, 28, 27, 28, 28, 27, 29, 29, 28, 29, 29, 28, 29, 29, 28,
27, 29, 27, 36, 34, 34, 39, 34, 31, 31, 33, 35, 31, 31, 32, 37,
38, 32, 31, 28, 33, 33, 28, 27, 27, 30, 31, 32, 28, 27, 31, 36,
27, 33, 31, 34, 31, 35, 38, 37, 36, 39, 33, 33, 28, 41, 34, 35,
37, 37, 41, 32, 37, 30, 34, 38, 30, 40, 35, 31, 30, 30, 29, 29,
30, 29, 35, 28, 27, 27, 27, 29, 27, 28, 27, 27, 27, 26, 28, 28,
27, 29, 29, 27, 27, 27, 27, 29, 27, 28, 27, 28, 34, 29, 28, 28,
28, 29, 38, 33, 39, 28, 27, 28, 27, 29, 34, 29, 32, 70, 26, 29,
43, 48, 30, 30, 27, 26, 29, 27, 27, 27, 27, 28, 28, 27, 28, 28,
27, 28, 28, 38, 52, 26, 31, 56, 29, 29, 36, 28, 35, 32, 34, 35,
28, 27, 37, 26, 26, 32, 26, 27, 30, 28, 28, 30, 29, 30, 29, 29,
28, 26, 33, 39, 26, 31, 27, 28, 30, 30, 28, 28, 29, 26, 27, 26,
29, 28, 28, 27, 27, 27, 28, 27, 28, 28, 28, 28, 28, 27, 27, 29,
27, 26, 28, 28, 27, 27, 28, 27, 28, 28, 30, 27, 30, 28, 32, 34,
28, 27, 28, 28, 27, 28, 27, 27, 27, 28, 27, 28, 27, 27, 28, 27,
27, 27, 27, 27, 28, 27, 27, 27, 26, 27, 27, 30, 28, 27, 30, 30,
42, 26, 27, 40, 33, 29, 29, 29, 52, 58, 44, 32, 43, 30, 27, 38,
30, 27, 30, 27, 31, 39, 35, 32, 32, 34, 45, 31, 44, 42, 29, 29,
30, 30, 50, 30, 33, 31, 35, 27, 28, 27, 28, 55, 28, 28, 28, 27,
27, 28, 29, 27, 28, 27, 28, 28, 28, 28, 27, 28, 29, 34, 45, 27,
29, 61, 38, 62, 29, 36, 36, 30, 31, 45, 27, 30, 28, 29, 44, 45,
42, 52, 50, 52, 42, 38, 42, 32, 27, 37, 40, 52, 27, 36, 38, 39,
34, 30, 29, 34, 29, 26, 35, 43, 33, 40, 35, 33, 41, 61, 45, 35,
52, 50, 38, 43, 29, 35, 38, 39, 31, 28, 28, 29, 34, 27, 30, 32,
28, 26, 28, 27, 26, 29, 27, 26, 29, 29, 27, 29, 27, 27, 29, 27,
30, 29, 25, 30, 27, 29, 29, 30, 30, 27, 30, 28, 28, 27, 29, 29,
30, 29, 27, 28, 28, 28, 29, 28, 28, 27, 28, 29, 28, 29, 27, 28,
28, 28, 30, 27, 27, 28, 26, 28, 27, 27, 28, 28, 28, 28, 27, 27,
28, 27, 28, 27, 35, 27, 27, 28, 29, 27, 27, 28, 26, 27, 28, 28,
28, 27, 27, 27, 28, 32, 27, 28, 28, 29, 28, 28, 27, 28, 28, 30,
29, 28, 25, 27, 28, 30, 28, 30, 30, 28, 30, 30, 28, 29, 30, 28,
28, 26, 27, 28, 45, 36, 40, 28, 50, 45, 30, 45, 40, 30, 45, 45,
29, 45, 35, 40, 40, 30, 30, 30, 45, 40, 40, 40, 40, 40, 40, 35,
34, 49, 40, 30, 61, 35, 40, 30, 36, 35, 29, 27, 48, 28, 27, 27,
26, 27, 29, 27, 26, 27, 31, 27, 27, 28, 29, 28, 27, 28, 29, 38,
30, 26, 36, 40, 58, 57, 30, 33, 56, 35, 39, 37, 38, 46, 37, 39,
39, 45, 35, 46, 58, 65, 60, 45, 32, 36, 43, 32, 68, 39, 28, 31,
27, 28, 27, 37, 38, 30, 30, 28, 36, 45, 28, 26, 28, 28, 28, 27,
26, 28, 27, 26, 26, 27, 28, 31, 32, 37, 35, 29, 33, 35, 29, 41,
32, 36, 29, 28, 28, 28, 37, 36, 37, 35, 31, 32, 30, 27, 31, 32,
31, 33, 28, 33, 29, 27, 28, 31, 28, 31, 28, 34, 27, 27, 28, 27,
27, 27, 27, 26, 26, 26, 27, 27, 28, 26, 31, 26, 29, 31, 29, 29,
30, 29, 30, 31, 32, 29, 30, 27, 32, 27, 26, 31, 31, 31, 27, 27,
33, 27, 28, 28, 28, 26, 27, 27, 28, 30, 27, 27, 30, 29, 26, 27,
28, 27, 26, 26, 28, 27, 26, 28, 28, 26, 28, 27, 29, 27, 28, 28,
26, 26, 29, 28, 27, 27, 27, 28, 26, 25, 27, 29, 30, 36, 40, 28,
38, 26, 27, 27, 50, 27, 45, 27, 28, 26, 25, 35, 35, 44, 30, 27,
31, 27, 28, 27, 27, 28, 28, 28, 35, 33, 30, 28, 28, 29, 29, 36,
32, 36, 34, 32, 28, 28, 29, 28, 28, 32, 30, 35, 33, 36, 32, 30,
32, 36, 34)
quantile(x, probs = c(0.333, 0.666))
#> 33.3% 66.6%
#> 28 31
l = cut(x, breaks = c(-Inf, 28, 31, Inf))
table(l)
#> l
#> (-Inf,28] (28,31] (31, Inf]
#> 387 185 246
#using different cut-off points yielded more equal groups
l = cut(x, breaks = c(-Inf, 28, 32, Inf))
table(l)
#> l
#> (-Inf,28] (28,32] (32, Inf]
#> 387 214 217
#again using different cut-off points which yielded more equal groups
l = cut(x, breaks = c(-Inf, 27, 32, Inf))
table(l)
#> l
#> (-Inf,27] (27,32] (32, Inf]
#> 222 379 217
Created on 2024-10-07 with reprex v2.1.1
Edit: I think the word "equal" is not clear, so I think I can say, I seek the most suitable allocation that gives the lowest difference between the highest and lowest group observation numbers, with grouping only consecutive numbers without ties being on more than one group
This script calculates all the possible groups and determine the cutting points that are "the most equal", understood as the cutting points to which the difference between the larger and the smaller group is minimal.
Such combinatorial aproach is only possible for moderate number of groups and data size. Otherwise, the task of determine the groups given the sums is NP hard (I do not known if it is in ordered cuts)
# Count and order the classes
tbl <- unclass(table(x))
# Enumerate the possible cutting points. There are 595 possibilities
cutting <- combn(length(tbl) - 1, 2)
# sum the nubmber of elements of each possible group
sums <- apply(cutting, 2, \(i) c(
sum(tbl[1:i[1]]), # From the minimal to the first cutting point (inclusive)
sum(tbl[(i[1]+1):i[2]]), # from next to the first cutting point to second (inclusive)
sum(tbl[-(1:i[2])]))) # from next to second to last
#check that sum of posible "sums" equals the sum iof tbl (818)
stopifnot(all(colSums(sums) == sum(tbl)))
#calculate differnece between the largest and the smallest group
# as a metric of "most equal groups.
count_diff <- apply(sums, 2, \(i) max(i) - min(i))
#FINALLY: best cut points (inclusive)
print(names(tbl)[cutting[,order(count_diff)[1]]])
# 27 & 30
print(sums[,order(count_diff)[1]])
# 222, 318, 278
#second best (and so on...)
names(tbl)[cutting[,order(count_diff)[2]]]
x = c(26, 34, 27, 26, 38, 40, 34, 28, 27, 36, 29, 30, 29, 44, 30,
34, 32, 30, 26, 29, 34, 32, 38, 27, 35, 29, 28, 34, 26, 27, 27,
30, 27, 28, 27, 28, 28, 27, 29, 29, 28, 29, 29, 28, 29, 29, 28,
27, 29, 27, 36, 34, 34, 39, 34, 31, 31, 33, 35, 31, 31, 32, 37,
38, 32, 31, 28, 33, 33, 28, 27, 27, 30, 31, 32, 28, 27, 31, 36,
27, 33, 31, 34, 31, 35, 38, 37, 36, 39, 33, 33, 28, 41, 34, 35,
37, 37, 41, 32, 37, 30, 34, 38, 30, 40, 35, 31, 30, 30, 29, 29,
30, 29, 35, 28, 27, 27, 27, 29, 27, 28, 27, 27, 27, 26, 28, 28,
27, 29, 29, 27, 27, 27, 27, 29, 27, 28, 27, 28, 34, 29, 28, 28,
28, 29, 38, 33, 39, 28, 27, 28, 27, 29, 34, 29, 32, 70, 26, 29,
43, 48, 30, 30, 27, 26, 29, 27, 27, 27, 27, 28, 28, 27, 28, 28,
27, 28, 28, 38, 52, 26, 31, 56, 29, 29, 36, 28, 35, 32, 34, 35,
28, 27, 37, 26, 26, 32, 26, 27, 30, 28, 28, 30, 29, 30, 29, 29,
28, 26, 33, 39, 26, 31, 27, 28, 30, 30, 28, 28, 29, 26, 27, 26,
29, 28, 28, 27, 27, 27, 28, 27, 28, 28, 28, 28, 28, 27, 27, 29,
27, 26, 28, 28, 27, 27, 28, 27, 28, 28, 30, 27, 30, 28, 32, 34,
28, 27, 28, 28, 27, 28, 27, 27, 27, 28, 27, 28, 27, 27, 28, 27,
27, 27, 27, 27, 28, 27, 27, 27, 26, 27, 27, 30, 28, 27, 30, 30,
42, 26, 27, 40, 33, 29, 29, 29, 52, 58, 44, 32, 43, 30, 27, 38,
30, 27, 30, 27, 31, 39, 35, 32, 32, 34, 45, 31, 44, 42, 29, 29,
30, 30, 50, 30, 33, 31, 35, 27, 28, 27, 28, 55, 28, 28, 28, 27,
27, 28, 29, 27, 28, 27, 28, 28, 28, 28, 27, 28, 29, 34, 45, 27,
29, 61, 38, 62, 29, 36, 36, 30, 31, 45, 27, 30, 28, 29, 44, 45,
42, 52, 50, 52, 42, 38, 42, 32, 27, 37, 40, 52, 27, 36, 38, 39,
34, 30, 29, 34, 29, 26, 35, 43, 33, 40, 35, 33, 41, 61, 45, 35,
52, 50, 38, 43, 29, 35, 38, 39, 31, 28, 28, 29, 34, 27, 30, 32,
28, 26, 28, 27, 26, 29, 27, 26, 29, 29, 27, 29, 27, 27, 29, 27,
30, 29, 25, 30, 27, 29, 29, 30, 30, 27, 30, 28, 28, 27, 29, 29,
30, 29, 27, 28, 28, 28, 29, 28, 28, 27, 28, 29, 28, 29, 27, 28,
28, 28, 30, 27, 27, 28, 26, 28, 27, 27, 28, 28, 28, 28, 27, 27,
28, 27, 28, 27, 35, 27, 27, 28, 29, 27, 27, 28, 26, 27, 28, 28,
28, 27, 27, 27, 28, 32, 27, 28, 28, 29, 28, 28, 27, 28, 28, 30,
29, 28, 25, 27, 28, 30, 28, 30, 30, 28, 30, 30, 28, 29, 30, 28,
28, 26, 27, 28, 45, 36, 40, 28, 50, 45, 30, 45, 40, 30, 45, 45,
29, 45, 35, 40, 40, 30, 30, 30, 45, 40, 40, 40, 40, 40, 40, 35,
34, 49, 40, 30, 61, 35, 40, 30, 36, 35, 29, 27, 48, 28, 27, 27,
26, 27, 29, 27, 26, 27, 31, 27, 27, 28, 29, 28, 27, 28, 29, 38,
30, 26, 36, 40, 58, 57, 30, 33, 56, 35, 39, 37, 38, 46, 37, 39,
39, 45, 35, 46, 58, 65, 60, 45, 32, 36, 43, 32, 68, 39, 28, 31,
27, 28, 27, 37, 38, 30, 30, 28, 36, 45, 28, 26, 28, 28, 28, 27,
26, 28, 27, 26, 26, 27, 28, 31, 32, 37, 35, 29, 33, 35, 29, 41,
32, 36, 29, 28, 28, 28, 37, 36, 37, 35, 31, 32, 30, 27, 31, 32,
31, 33, 28, 33, 29, 27, 28, 31, 28, 31, 28, 34, 27, 27, 28, 27,
27, 27, 27, 26, 26, 26, 27, 27, 28, 26, 31, 26, 29, 31, 29, 29,
30, 29, 30, 31, 32, 29, 30, 27, 32, 27, 26, 31, 31, 31, 27, 27,
33, 27, 28, 28, 28, 26, 27, 27, 28, 30, 27, 27, 30, 29, 26, 27,
28, 27, 26, 26, 28, 27, 26, 28, 28, 26, 28, 27, 29, 27, 28, 28,
26, 26, 29, 28, 27, 27, 27, 28, 26, 25, 27, 29, 30, 36, 40, 28,
38, 26, 27, 27, 50, 27, 45, 27, 28, 26, 25, 35, 35, 44, 30, 27,
31, 27, 28, 27, 27, 28, 28, 28, 35, 33, 30, 28, 28, 29, 29, 36,
32, 36, 34, 32, 28, 28, 29, 28, 28, 32, 30, 35, 33, 36, 32, 30,
32, 36, 34)