pythondataframesample

Dataframe - Select the minimum set of rows to cover all possible values of each columns


I'm working on a use-case on which I need to retrieve a minimal sample of rows from a dataframe that contains at least one row for each unique value found in all columns.

A simplified example could be :

ID A B C
1 D G X
2 D G Y
3 E G Y
4 E H Y
5 F I Z

Here I would like to keep rows with ID 1, 4 and 5, in order to have at least one row with values D, E and F from A; G, H and I from B and X, Y and Z from C. I don't need to have all combinations, only all uniques values from every column:

ID A B C
1 D G X
4 E H Y
5 F I Z

What could be an efficient way to do this?

Thank you


Solution

  • Here is what I ended up going with:

    Far from the optimal solution as I'm only minimizing the number of rows on each step, but it's pretty efficient and it gets the job done. Could have probably gone with a recursive algorithm to test all solutions but I don't have a big requirement on the final set of rows, except it needed all possible values.

    Thank you all for inputs, they helped me come up to this solution.