algorithm set mathematical-optimization set-theory

Disjoint Sets of Strings - Minimization Problem

There are two sets, s1 and s2, each containing pairs of letters. A pair is only equivalent to another pair if their letters are in the same order, so they're essentially strings (of length 2). The sets s1 and s2 are disjoint, neither set is empty, and each pair of letters only appears once.

Here is an example of what the two sets might look like:

s1 = { ax, bx, cy, dy }
s2 = { ay, by, cx, dx }

The set of all letters in (s1 ∪ s2) is called sl. The set sr is a set of letters of your choice, but must be a subset of sl. Your goal is to define a mapping m from letters in sl to letters in sr, which, when applied to s1 and s2, will generate the sets s1' and s2', which also contain pairs of letters and must also be disjoint.

The most obvious m just maps each letter to itself. In this example (shown below), s1 is equivalent to s1', and s2 is equivalent to s2' (but given any other m, that would not be the case).

a -> a
b -> b
c -> c
d -> d
x -> x
y -> y

The goal is to construct m such that sr (the set of letters on the right-hand side of the mapping) has the fewest number of letters possible. To accomplish this, you can map multiple letters in sl to the same letter in sr. Note that depending on s1 and s2, and depending on m, you could potentially break the rule that s1' and s2' must be disjoint. For example, you would obviously break that rule by mapping every letter in sl to a single letter in sr.

So, given s1 and s2, how can someone construct an m that minimizes sr, while ensuring that s1' and s2' are disjoint?

Here is a simplified visualization of the problem:

Solution

This problem is NP-hard, to show this, consider reducing graph coloring to this problem.

Proof: Let G=(V,E) be the graph for which we want to compute the minimal graph coloring problem. Formally, we want to compute the chromatic number of the graph, which is the lowest k for which G is k colourable.

To reduce the graph coloring problem to the problem described here, define

 s1 = { zu : (u,v) \in E }
 s2 = { zv : (u,v) \in E }

where z is a magic value unused other than in constructing s1 & s2.

By construction of the sets above, for any mapping m and any edge (u,v) we must have m(u) != m(v), otherwise the disjointedness of s1' and s2' would be violated. Thus, any optimal sr is the set of optimal colors (with the exception of z) to color the graph G and m is the mapping that defines which node is assigned which color. QED.

The proof above may give the intuition that researching graph coloring approximations would be a good start, and indeed it probably would, but there is a confounding factor involved. This confounding factor is that for two elements ab \in s1 and cd \in s2, if m(a) = m(c) then m(b) != m(d). Logically, this is equivalent to the statement m(a) != m(c) or m(b) != m(d). These types of constraints, in isolation, do not map naturally to an analogous graph problem (because of the or statement.)

There are ways to formulate this problem as an (binary) ILP and solve it as such. This would likely give you (slightly) inferior results to a custom designed & tuned branch-and-bound implementation (assuming you want to find the optimal solution) but would work with turn-key solvers.

If you are more interested in approximations (possibly with guaranteed ratios of optimality) I would investigate a SDP relaxation to your problem & appropriate rounding scheme. This level of work would likely be the kind one would invest in a small-to-medium sized research paper.