algorithmcluster-analysisk-means

Algorithm to find k optimal representatives for subsets of a set with arbitrary cost function


Given a set of N points, I am required to split it into k subsets S1, ..., Sk. Each subset Si will have a representative Ri. I want to find these R1, ..., Rk to minimize an arbitrary cost function of the subset-members with respect to the cluster representative. Basically,

\min \sum_{i=1}^k \sum_{Pj \in Si} cost(Pj, Ri)

where the cluster representative Ri is itself obtained from the cluster members by another arbitrary reduce function

Ri = reduce(Si)

I took inspiration from k-means clustering, and came up with the below algorithm, that I'm calling k-reduce clustering. I want to know if there is any algorithm or family of algorithms that does what I'm trying to do.

# Initialize using k random samples from S, could use better initialization
cluster_repr = random_samples(S, k) # list of k points
clusters = None

while True:

    # Step 1: Assignment
    old_clusters = clusters
    clusters = [[] for i in range(n)]
    for Pj in S:
        # assign s to the cluster that minimizes its cost wrt to the representative
        cluster_idx = argmin(
            cluster_repr,
            lambda Ri : cost_fn(Pj, Ri)
        )
        clusters[cluster_idx].append(Pj)

    # Step 2: Update step
    cluster_repr = [reduce_fn(clusters[i]) for i in range(n)]

    total_dist = None
    if old_clusters == clusters: break

Solution

  • Without some assumptions about those functions, this problem is NP-hard.

    Demonstration. Let's reduce 3DM to this. Take a 3-partite hypergraph with 3k nodes. Create a cost function cost(Pj, Ri) = ||Ri||. Create a reduce function that will map to a "representative" that has norm 1 for any triplet in the hypergraph, and otherwise will be very large. And now if there is a 3-matching, the optimal solution will be that 3-matching, and we've reduced a known NP-complete problem (in fact one of Karp's original 21) to this one.

    In general we don't know a way to verify that a solution is optimal in polynomial time. So we don't know that your problem is in P. Therefore the best that we know how to prove is that it is NP-hard.