sqlt-sql

SQL query to select fewest rows that cover all columns


I'm looking to get column coverage with the fewest amount of rows possible. The table is a list of products with quite a few categorisation columns. I want the rows to cover all the columns that have a 1 in them (some columns may from time to time have no 1s in), all rows and columns will either contain a 1 or 0 value.

ID cat1 cat2 cat3 cat4
123 1 1 0 0
324 0 0 0 0
555 1 0 0 0
665 0 1 1 0
4455 0 1 1 0
6756 0 0 1 0
6563 0 0 0 1
22222 0 1 1 1

The table at present, is roughly 15k rows and 35 columns. When I say the fewest, it doesn't have to be the absolute fewest, if we have a row or 2 extra on top of optimal I'm ok with that. The output from the table above would be:

ID cat1 cat2 cat3 cat4
123 1 1 0 0
22222 0 1 1 1

the variant of SQL i'm using is T-SQL.

My thinking was to take the first row that covers the most columns and then loop through on the remaining columns required with the remaining rows and repeat until all columns are accounted for. columns with no 1 value in all the rows would be removed prior to the process.

I've tried using CTEs, but it gets very long and it's not dynamic, so is not going to be very repeatable.

I've loaded an example of the table here if it's of any use: https://tmpfiles.org/21721172/product_category_binary_file.csv

I have loaded some of the data into db-fiddle here https://www.db-fiddle.com/f/3wSz4fXaecU4gbrkQnG5da/21


Solution

  • Pivoting and computing max's per category

    Instead of iteratively or recursively computing row by row or category by category,
    let's imagine we want to select, for each category, the "wealthiest" ids it reaches:

    This will mechanically discard IDs with less categories.
    Particularly, if B's categories are a subset of A's (let's say B has categories 1 and 2, and A has 1, 2, 3), for each category of B, A has the same category (thus they are in competition to be elected for that category), but as A has 1 more category (is wealthiest), it will take over B in the competition for this category.
    Thus after having tried every category, B can claim none:
    only A appears in the results.

    Now let's transpose this reasoning to SQL:

    with
        cats as
        (
            select id, cat
            from t unpivot (val for cat in (f_prodImage,f_award,f_awards,f_ballot /* list all category columns of t here */)) p
            where val = 1
        ),
        -- For each id, count how many categories it covers.
        n as (select id, count(1) n from cats group by id),
        -- For each category, find which ids (having this category) have the maximum count of categories.
        catmax as
        (
            select cat, max(n) nmax
            from cats join n on n.id = cats.id
            group by cat
        ),
        -- Now select every cat / id tuple, for ids that are a maximum for this category.
        maxs as
        (
            select n.id, string_agg(catmax.cat, ', ') within group (order by catmax.cat) maxforcat
            from catmax join cats on cats.cat = catmax.cat join n on n.id = cats.id and n.n = catmax.nmax
            group by n.id
        )
    -- Now just select entries 
    select *
    from t join maxs on maxs.id = t.id
    order by len(maxforcat) desc, maxforcat;
    

    Which can be seen on a small test dataset
    or on a big, complete one.