pythonsqlpandasdatabaseansi-sql

How do I select the group with the least number of null values in a groupby?


Example:

row_number |id |firstname | middlename | lastname |
0          | 1 | John     | NULL       | Doe      |
1          | 1 | John     | Jacob      | Doe      |
2          | 2 | Alison   | Marie      | Smith    |
3          | 2 | NULL     | Marie      | Smith    |
4          | 2 | Alison   | Marie      | Smith    |

I'm trying to figure out how to groupby id, and then grab the row with the least number of NULL values for each groupby, dropping any extra rows that contain the least number of NULLs is fine (for example, dropping row_number 4 since it ties row_number 2 for the least number of NULLS where id=2)

The answer for this example would be the row_numbers 1 and 2

Preferably would be ANSI SQL, but I can translate other languages (like python with pandas) if you can think of a way to do it

Edit: Added a row for the case of tie-breaking.


Solution

  • If you want to do this pandas, you can do it this way:

    df[df.assign(NC = df.isnull().sum(1)).groupby('id')['NC'].transform(lambda x: x == x.min())]
    

    Output:

       row_number  id firstname middlename lastname
    1           1   1      John      Jacob      Doe
    2           2   2    Alison      Marie    Smith
    

    For tiebreaker:

    Add a row:

    df.loc[4,['row_number','id','firstname','middlename','lastname']] = ['4',2,'Mary','Maxine','Maxwell']
    

    Then use groupby, transform, and idxmin:

    df[df.index == df.assign(NC = df.isnull().sum(1)).groupby('id')['NC'].transform('idxmin')]
    

    Output:

      row_number id firstname middlename lastname
    1          1  1      John      Jacob      Doe
    2          2  2    Alison      Marie    Smith