I need to distribute a set of repetitive strings as evenly as possible.
Is there any way to do this better then simple shuffling using unsort? It can't do what I need.
For example if the input is
aaa
aaa
aaa
bbb
bbb
The output I need
aaa
bbb
aaa
bbb
aaa
The number of repetitive strings doesn't have any limit as well as the number of the reps of any string.
The input can be changed to list string number_of_reps
aaa 3
bbb 2
... .
zzz 5
Is there an existing tool, Perl module or algorithm to do this?
Abstract: Given your description of how you determine an “even distribution”, I have written an algorithm that calculates a “weight” for each possible permutation. It is then possible to brute-force the optimal permutation.
By "evenly distribute" I mean that intervals between each two occurrences of a string and the interval between the start point and the first occurrence of the string and the interval between the last occurrence and the end point must be as much close to equal as possible where 'interval' is the number of other strings.
It is trivial to count the distances between occurrences of strings. I decided to count in a way that the example combination
A B A C B A A
would give the count
A: 1 2 3 1 1
B: 2 3 3
C: 4 4
I.e. Two adjacent strings have distance one, and a string at the start or the end has distance one to the edge of the string. These properties make the distances easier to calculate, but are just a constant that will be removed later.
This is the code for counting distances:
sub distances {
my %distances;
my %last_seen;
for my $i (0 .. $#_) {
my $s = $_[$i];
push @{ $distances{$s} }, $i - ($last_seen{$s} // -1);
$last_seen{$s} = $i;
}
push @{ $distances{$_} }, @_ - $last_seen{$_} for keys %last_seen;
return values %distances;
}
Next, we calculate the standard variance for each set of distances. The variance of one distance d describes how far they are off from the average a. As it is squared, large anomalies are heavily penalized:
variance(d, a) = (a - d)²
We get the standard variance of a data set by summing the variance of each item, and then calculating the square root:
svar(items) = sqrt ∑_i variance(items[i], average(items))
Expressed as Perl code:
use List::Util qw/sum min/;
sub svar (@) {
my $med = sum(@_) / @_;
sqrt sum map { ($med - $_) ** 2 } @_;
}
We can now calculate how even the occurrences of one string in our permutation are, by calculating the standard variance of the distances. The smaller this value is, the more even the distribution is.
Now we have to combine these weights to a total weight of our combination. We have to consider the following properties:
The following can be swapped out by a different procedure, but I decided to weigh each standard variance by raising it to the power of occurrences, then adding all weighted svariances:
sub weigh_distance {
return sum map {
my @distances = @$_; # the distances of one string
svar(@distances) ** $#distances;
} distances(@_);
}
This turns out to prefer good distributions.
We can now calculate the weight of a given permutation by passing it to weigh_distance
. Therefore, we can decide if two permutations are equally well distributed, or if one is to be prefered:
Given a selection of permuations, we can select those permutations that are optimal:
sub select_best {
my %sorted;
for my $strs (@_) {
my $weight = weigh_distance(@$strs);
push @{ $sorted{$weight} }, $strs;
}
my $min_weight = min keys %sorted;
@{ $sorted{$min_weight} }
}
This will return at least one of the given possibilities. If the exact one is unimportant, an arbitrary element of the returend array can be selected.
Bug: This relies on stringification of floats, and is therefore open to all kinds of off-by-epsilon errors.
For a given multiset of strings, we want to find the optimal permutation. We can think of the available strings as a hash mapping the strings to the remaining avaliable occurrences. With a bit of recursion, we can build all permutations like
use Carp;
# called like make_perms(A => 4, B => 1, C => 1)
sub make_perms {
my %words = @_;
my @keys =
sort # sorting is important for cache access
grep { $words{$_} > 0 }
grep { length or carp "Can't use empty strings as identifiers" }
keys %words;
my ($perms, $ok) = _fetch_perm_cache(\@keys, \%words);
return @$perms if $ok;
# build perms manually, if it has to be.
# pushing into @$perms directly updates the cached values
for my $key (@keys) {
my @childs = make_perms(%words, $key => $words{$key} - 1);
push @$perms, (@childs ? map [$key, @$_], @childs : [$key]);
}
return @$perms;
}
The _fetch_perm_cache
returns an ref to a cached array of permutations, and a boolean flag to test for success. I used the following implementation with deeply nested hashes, that stores the permutations on leaf nodes. To mark the leaf nodes, I have used the empty string—hence the above test.
sub _fetch_perm_cache {
my ($keys, $idxhash) = @_;
state %perm_cache;
my $pointer = \%perm_cache;
my $ok = 1;
$pointer = $pointer->{$_}[$idxhash->{$_}] //= do { $ok = 0; +{} } for @$keys;
$pointer = $pointer->{''} //= do { $ok = 0; +[] }; # access empty string key
return $pointer, $ok;
}
That not all strings are valid input keys is no issue: every collection can be enumerated, so make_perms
could be given integers as keys, which are translated back to whatever data they represent by the caller. Note that the caching makes this non-threadsafe (if %perm_cache
were shared).
This is now a simple matter of
say "@$_" for select_best(make_perms(A => 4, B => 1, C => 1))
This would yield
A A C A B A
A A B A C A
A C A B A A
A B A C A A
which are all optimal solutions by the used definition. Interestingly, the solution
A B A A C A
is not included. This could be a bad edge case of the weighing procedure, which strongly favours putting occurrences of rare strings towards the center. See Futher work.
Preferable versions are first: AABAA ABAAA, ABABACA ABACBAA(two 'A' in a row), ABAC ABCA
We can run these test cases by
use Test::More tests => 3;
my @test_cases = (
[0 => [qw/A A B A A/], [qw/A B A A A/]],
[1 => [qw/A B A C B A A/], [qw/A B A B A C A/]],
[0 => [qw/A B A C/], [qw/A B C A/]],
);
for my $test (@test_cases) {
my ($correct_index, @cases) = @$test;
my $best = select_best(@cases);
ok $best ~~ $cases[$correct_index], "[@{$cases[$correct_index]}]";
}
Out of interest, we can calculate the optimal distributions for these letters:
my @counts = (
{ A => 4, B => 1 },
{ A => 4, B => 2, C => 1},
{ A => 2, B => 1, C => 1},
);
for my $count (@counts) {
say "Selecting best for...";
say " $_: $count->{$_}" for keys %$count;
say "@$_" for select_best(make_perms(%$count));
}
This brings us
Selecting best for...
A: 4
B: 1
A A B A A
Selecting best for...
A: 4
C: 1
B: 2
A B A C A B A
Selecting best for...
A: 2
C: 1
B: 1
A C A B
A B A C
C A B A
B A C A
The standard variances are raised to the power of the occurrences. This is probably not ideal, as a large deviation for a large number of occurrences weighs lighter than a small deviation for few occurrences, e.g.
weight(svar, occurrences) → weighted_variance
weight(0.9, 10) → 0.35
weight(0.5, 1) → 0.5
This should in fact be reversed.
Below is a faster procedure that approximates a good distribution. In some cases, it will yield the correct solution, but this is not generally the case. The output is bad for inputs with many different strings where most have very few occurrences, but is generally acceptable where only few strings have few occurrences. It is significantly faster than the brute-force solution.
It works by inserting strings at regular intervals, then spreading out avoidable repetitions.
sub approximate {
my %def = @_;
my ($init, @keys) = sort { $def{$b} <=> $def{$a} or $a cmp $b } keys %def;
my @out = ($init) x $def{$init};
while(my $key = shift @keys) {
my $visited = 0;
for my $parts_left (reverse 2 .. $def{$key} + 1) {
my $interrupt = $visited + int((@out - $visited) / $parts_left);
splice @out, $interrupt, 0, $key;
$visited = $interrupt + 1;
}
}
# check if strings should be swapped
for my $i ( 0 .. $#out - 2) {
@out[$i, $i + 1] = @out[$i + 1, $i]
if $out[$i] ne $out[$i + 1]
and $out[$i + 1] eq $out[$i + 2]
and (!$i or $out[$i + 1 ] ne $out[$i - 1]);
}
return @out;
}
I generalized the algorithm for any objects, not just strings. I did this by translating the input to an abstract representation like “two of the first thing, one of the second”. The big advantage here is that I only need integers and arrays to represent the permutations. Also, the cache is smaller, because A => 4, C => 2
, C => 4, B => 2
and $regex => 2, $fh => 4
represent the same abstract multisets. The speed penalty incurred by the neccessity to transform data between the external, internal, and cache representations is roughly balanced by the reduced number of recursions.
The large bottleneck is in the select_best
sub, which I have largely rewritten in Inline::C (still eats ~80% of execution time).
These issues go a bit beyond the scope of the original question, so I won't paste the code in here, but I guess I'll make the project available via github once I've ironed out the wrinkles.