Given a set of N points, I am required to split it into k subsets S1, ..., Sk. Each subset Si will have a representative Ri. I want to find these R1, ..., Rk to minimize an arbitrary cost function of the subset-members with respect to the cluster representative. Basically,
\min \sum_{i=1}^k \sum_{Pj \in Si} cost(Pj, Ri)
where the cluster representative Ri is itself obtained from the cluster members by another arbitrary reduce function
Ri = reduce(Si)
I took inspiration from k-means clustering, and came up with the below algorithm, that I'm calling k-reduce clustering. I want to know if there is any algorithm or family of algorithms that does what I'm trying to do.
# Initialize using k random samples from S, could use better initialization
cluster_repr = random_samples(S, k) # list of k points
clusters = None
while True:
# Step 1: Assignment
old_clusters = clusters
clusters = [[] for i in range(n)]
for Pj in S:
# assign s to the cluster that minimizes its cost wrt to the representative
cluster_idx = argmin(
cluster_repr,
lambda Ri : cost_fn(Pj, Ri)
)
clusters[cluster_idx].append(Pj)
# Step 2: Update step
cluster_repr = [reduce_fn(clusters[i]) for i in range(n)]
total_dist = None
if old_clusters == clusters: break
Without some assumptions about those functions, this problem is NP-hard.
Demonstration. Let's reduce 3DM to this. Take a 3-partite hypergraph with 3k
nodes. Create a cost function cost(Pj, Ri) = ||Ri||
. Create a reduce
function that will map to a "representative" that has norm 1 for any triplet in the hypergraph, and otherwise will be very large. And now if there is a 3-matching, the optimal solution will be that 3-matching, and we've reduced a known NP-complete problem (in fact one of Karp's original 21) to this one.
In general we don't know a way to verify that a solution is optimal in polynomial time. So we don't know that your problem is in P. Therefore the best that we know how to prove is that it is NP-hard.