databaseoptimizationkdbinsertion

Insertion operation complexity on KDB table with sorted attribute


Let there be a KDB table with columns A,B and C AND it is sorted on column A. I wanted to understand the complexity of insertion of records to this table (given that it has to keep the table sorted on A).

  1. would it help (in terms of complexity) if the insertions to this table are guranteed to be in sorted order of A. Which means that at any time t2>t1, A(t2)>A(t1) ?
  2. is there a way i can exploit the above fact (t2>t1 => A(t2)>A(t1)) and optimize the queries without even applying the sorted attribute on A ?
  3. I know there is one way to perform binary search on a column, but I mainly wanted to know if there is a way in which I can tell the query planner to "assume" a column is sorted (without actually having sorted attribute on it because i want to avoid insertion complexity associated with it) and execute the query accordingly ?

Solution

  • My thoughts (some of which are just opinion as we can't see exactly what kdb does under the covers):

    A. To clarify - kdb itself doesn't "keep the table sorted". Kdb will insert the data regardless, it's up to the user to ensure the table remains sorted.

    B. I don't think you should be worried about overhead/complexity in kdb inserts - I would estimate that insert is (one of) the single most optimised operation in all of kdb

    C. Whether the column has an attribute or not, kdb will do the insert either way and possibly only after the insert does it check to see if the attribute was retained or not. This would be a highly optimised check. s# will be lost on a non-sorted inserted. u# will be lost on a non-unique insert. p# will be lost on any insert as it's generally intended to be used on static/on-disk data.

    D. The only case where there would be a non-negligable cost/complexity of insert would be in the case of maintaining a grouped attribute since g# is always retained on insert and there's the overhead of updating the hidden hash table. But even then, this overhead doesn't impact high-volume RDBs which have billions of inserts in a given day.

    None of these are actual hard numbers or big O/complexity information but in my experience the big O/complexity is more relevant to the lookup of attribute-having data rather than the insert/maintenance of the attribute/data. In my experience the insert has never been a concern.

    To answer your actual questions:

    1. As I eluded to in (A), if you want to have a sorted attribute and you want to retain it then you must ensure the data is inserted in sorted order

    2. If there's no attribute then kdb treats a column/vector like any other vector - it would scan the entire vector each time since there is no flag/attribute telling it to use an optimisation. The only exception to this is with as-of joins (or window joins) aj/wj where an aj on say `sym`time assumes that time is sorted within sym, without there being an explicit s# attribute on time.

    3. Apart from the aj/wj exceptions above, no if you want to take advantage of the sorted nature of the data to speed up queries then you need to have an s# attribute on it. Unless of course you use a different attribute such as p# which as I mentioned earlier has its own caveats