kdb

(q/kdb+) How to group data with slight date discrepancies (+/-1 day) for unique identifier creation


I am trying to create uniqueid column for each child (as they appear more than once in the table) so that I can identify between children with same name (no child has the same dadname, mumname and childdob)

q)group select dadname, mumname, childdob from peopletb
dadname         mumname            childdob   |                  
----------------------------------------------| -----------------
simon           jessica             2007.03.27| 0 1658 2292 18295
mike            rachel              2007.02.06| 1 4293 10976 33439 53751 63077 81844 90244 98782 ..
james           lucy                2005.04.09| 2 12756 21354
francis         anne                2007.04.12| 3 4151 8252 159075 172735 175531 177199 206953 40..
francis         anne                2007.04.13| 7 3255 7292 128021 143595 157839 168120 186578 21..
mike            rachel              2007.02.05| 10 4292 10946 33339 56751 67077 82844 96244 99381 ..

and so on

The issue is, is that there are some slight discrepancies in the childdob where it has been incremented by one day or the opposite it has gone back one day.

when grouping the data i want it to group childdob +1 or -1 from each other so that the discrepancy is eliminated.

e.g.

mike            rachel              2007.02.05| 10 4292 10946 33339 56751 67077 82844 96244 99381 ..
&
mike            rachel              2007.02.06| 1 4293 10976 33439 53751 63077 81844 90244 98782 ..

would have the same uniqueid

as would

francis         anne                2007.04.12| 3 4151 8252 159075 172735 175531 177199 206953 40..
&
francis         anne                2007.04.13| 7 3255 7292 128021 143595 157839 168120 186578 21..

Solution

  • There's a few tricky edge cases when you don't know the true dob, and if you have datapoints both above and below, but something like this might be a starting point:

    peopletb:([]dadname:100?`simon`mike`james;mumname:100?`jessica`rachel`lucy;childdob:100?2007.01.03 2007.01.02 2007.01.04 2007.06.01 2007.08.10 2007.08.09 2007.10.23 2007.10.24)
    
    /default grouping
    q)select idx:i by dadname,mumname,childdob from `childdob xasc peopletb
    dadname mumname childdob  | idx
    --------------------------| --------------
    james   jessica 2007.01.02| ,9
    james   jessica 2007.01.04| ,26
    james   jessica 2007.08.09| 56 58 59
    james   jessica 2007.08.10| 75 77
    james   jessica 2007.10.24| ,96
    james   lucy    2007.01.02| ,3
    james   lucy    2007.01.03| 13 21 24
    ...
    
    /step 1 - fudge the date
    q)fudge:update childdob:{(y;x)2>y-x}\[childdob]by dadname,mumname from `childdob xasc peopletb;
    /step 2 - group the fudged dates
    q)select idx:i by dadname,mumname,childdob from fudge
    dadname mumname childdob  | idx
    --------------------------| --------------------------
    james   jessica 2007.01.02| ,9
    james   jessica 2007.01.04| ,26
    james   jessica 2007.08.09| 56 58 59 75 77
    james   jessica 2007.10.24| ,96
    james   lucy    2007.01.02| 3 13 21 24
    james   lucy    2007.01.04| ,34
    james   lucy    2007.08.09| 50 51 55 60 65 66 70 76 78
    ...
    

    Notice how james/jessica combined 2007.08.09/10 into 09 but didn't combine 2007.01.02/04 (ambiguous, doesn't satisfy 2>y-x). However you could increase the fudge to 3>y-x for more leniency