I have been stuck in this problem and can't find the efficient solution for this problem .
I have N (Upto 10 Million ) arrays of say maximum 100 elements. These arrays contain numbers from 1-10000 .
Now my problem is to partition these arrays into K groups such that i minimize the duplicates across all the arrays i.e for an array containing 1, 4, 10 ,100 and another containing 1, 100. I would like them to go into same group because that minimizes duplicity. Two constraints my problem has are as follows -
i don't want to increase size of unique elements more than 110 for a group of arrays. So i have an array of size 100 and there is another array of size 100 which is a 60% match i would rather create new group because this increases no. of unique elements to 140 and this will go on increasing.
The number of vectors in the groups should be uniformly distributed.
Grouping these arrays based on size in decreasing order. Then finding unique vectors unique hashing and applying a greedy algo of maximum match with the constraints but the greedy doesn't seem to be working well because that will entirely depend on the partitions i picked first. I couldn't figure out how DP can be applied because number of combinations given total number of vectors is just huge. I am not sure what methodology should i take.
some of the fail cases of my algo are , say there are two vectors which are mutually exclusive of each other but if i form a group with them i could match 100% with a third vector which otherwise matched just 30% in a group and made that group full following the addition to that group this will increase my duplicity because the third vector should have formed a group with first two vectors.
Simple yet intensive on computing and memory is iterate 10 million times for each array to match maximum numbers match. Now store match numbers in an array and find match of such arrays similarly by iterating with criteria that match should be at least 60%