I'm developing custom subset-sum algorithms and have encountered a puzzling issue: it seems difficult to generate truly "hard" subset-sum instances (i.e., forcing exponential computational effort) without using very large integers (e.g., greater than about 2^22).
I'd specifically like to know:
For context, here are some attempts I've made at generating potentially hard instances (feedback or improvements welcome):
import random
def generate_exponential_instance(n):
max_element = 2 ** 22
A = [random.randint(1, max_element) for _ in range(n)]
while True:
mask = [random.choice([0, 1]) for _ in range(n)]
if sum(mask) != 0:
break
target = sum(A[i] * mask[i] for i in range(n))
return A, target
def generate_dense_high_values_instance(n):
base = 2 ** 22 - random.randint(0, 100)
A = [base + random.randint(0, 20) for _ in range(n)]
target = sum(random.sample(A, k=n // 2))
return A, target
def generate_merkle_hellman_instance(n, max_step=20):
total = 0
private_key = []
for _ in range(n):
next_val = total + random.randint(1, max_step)
private_key.append(next_val)
total += next_val
q = random.randint(total + 1, 2 * total)
r = random.randint(2, q - 1)
public_key = [(r * w) % q for w in private_key]
message = [random.randint(0, 1) for _ in range(n)]
ciphertext = sum(b * k for b, k in zip(message, public_key))
return public_key, ciphertext
We know subset-sum to be solvable in pseudopolynomial time. "Pseudopolynomial time" means the worst-case running time on large inputs is bounded by a polynomial in the input length and the largest numeric value in the input. Because a string of L
bits can encode numbers of size O(2^L)
, pseudopolynomial time algorithms really take exponential time (that is, exponential in the input length), but psuedopolynomial time algorithms are still considered to be better than "usual" exponential time algorithms since you can avoid the exponential behavior if you just use small numbers.
For example, even this simple algorithm for subset-sum:
def exists_subset_sum(num_set, target):
reachable = {0}
for num in num_set: reachable |= {prev + num for prev in reachable}
return target in reachable
has pseudopolynomial running time. The key observation is that the reachable
set contains only values in the interval [-i*M, i*M]
at the i
th iteration, where M
is max(abs(n) for n in num_set))
. Then, the size of the set is always O(L*M)
(where L
is the size of the input in bits) and operations on it take time polynomial in L
and M
, so the whole algorithm takes time polynomial in L
and M
. Technically, as M
is exponential in L
, the time complexity of exists_subset_sum
in terms of only L
is exponential. But, you can only get the exponential behavior if you let M
grow exponentially. If you just increase L
without increasing M
you will get at worst polynomial growth.
You'll notice that many algorithms for solving subset-sum are also pseudopolynomial.