kdbqpythonrethinkdb-pythonqpython3

Join columns to a table in parallel


Currently I use a function to run an aj join for a single column from a large table on to a smaller table which has its time column shifted t milliseconds ahead, joining on a sym column as well as time. I then compute and programatically name a new column based on this joined column, before deleting the original joined column from the small table. This returns the small table with a new column based on values joined from the larger table t milliseconds ahead.

I then use an Over loop / to repeat this over a list of different delays t, recursively adding one new column for each delay in the list, passing as an argument the table so columns are added recursively.

My issue is the query, join and processing are slow on a large table. I have many cores so I would like to parallelise this operation to take advantage of all available cores, as well as optimising the steps taken to add the new columns. The large table is partitioned on disk by date and sym.

[Edit:] Here is an example of what I have at the moment.

smallT: ([] sym: (20#`AAPL),(20#`MSFT); time: (asc 00:00:00+20?til 100), (asc 00:00:00+20?til 100));

bigT: ([] sym: (100#`AAPL),(100#`MSFT); time: (asc 00:00:00+til 100), (asc 00:00:00+til 100); price: (til 100),(til 100));

delays: 00:00:00 + (7 * til 5);

foo: ([bigTab; smallTab2; delays]

    smallTab2: aj[ `sym`time; `sym`time xasc select from (update time:time+delays from smallTab2); `sym`time xasc select sym, time, future pricesprice from bigTabl;

    smallTab2: ![smallTab2; (); 0b; enlist[$"colnametime_", string(`int$delays)] ! enlist(%;`future_price;100)];

    delete future_price from smallTab2

}[bigT];

smallT:foo/[select from smallT; delays];

smallT

I am relatively new to q and kdb so verbose explanations of how and why a solution works with working code on a toy example would be very greatly appreciated.


Solution

  • Your function is repeatedly sorting the tables inside the loop which will slow you down.

    Also as noted in documentation attributes should be applied on the tables which will greatly improve performance. https://code.kx.com/q/ref/aj/#performance

    Rather than looping through the delay offsets you can instead create the full list and only aj once. https://community.kx.com/t5/New-kdb-q-users-question-forum/How-do-you-start-thinking-in-vectors/td-p/12722

    smallT: ([] sym: (20#`AAPL),(20#`MSFT); time: (asc 00:00:00+20?til 100), (asc 00:00:00+20?til 100));
    bigT: ([] sym: (100#`AAPL),(100#`MSFT); time: (asc 00:00:00+til 100), (asc 00:00:00+til 100); price: (til 100),(til 100));
    delays: 00:00:00 + (7 * til 5);
    
    bigT:update `p#sym from `sym`time xasc bigT
    
    res:raze {[x;y] update row:i, delay:(`$"delay_",string`int$y),time+y from x}[smallT] each delays
    res:update `g#sym from `sym`time xasc res
    
    res:aj[ `sym`time;res; select sym, time, price from bigT]
    
    delete row from `sym`time xcols 0!(exec ({`$"delay_",string`int$x} each delays)#(delay!price) by row:row from res) lj 1!select row,sym,time from res
    sym  time     delay_0 delay_7 delay_14 delay_21 delay_28
    --------------------------------------------------------
    AAPL 00:00:17 17      24      31       38       45      
    AAPL 00:00:18 18      25      32       39       46      
    AAPL 00:00:18 18      25      32       39       46      
    AAPL 00:00:28 28      35      42       49       56      
    AAPL 00:00:33 33      40      47       54       61       
    ...