data-structuresb-treesearch-tree

searching within a Node on a CSB+ tree


I'm reading the paper, making B+-trees cache conscious in main memory. In Section 3.1.2, authors describe several approaches to searching within a CSB+ tree node.

Tha basic approach is to simply do a binary search using a conventional while loop.

The uniform approach is through code expansion, unfolding the while loop into if-then-else statements assuming all the keys are used.

Authors give the following example which exhibits an unfolding of the search for a node with up to 9 keys. The number in a node represents the position of the key being used in an if test

              4
            /   \
           2     6
          / \   / \
         1   3 5   8
                  / \
                 7   9

Then comes the confusing part:

If only 5 keys were actually present, we could traverse this tree with exactly 3 comparisons. On the other hand, an unfolding that put the deepest subtree at the left instead of the right would need 4 comparisons on some branches.

So why would it need more comparisons in the following tree:

              6
            /   \
           4     8
          / \   / \
         2   5 7   9
        / \
       1   3

Furthermore,

if we knew we had only five valid keys, we could hardcode a tree that, on average, used 2.67 comparisons rather than 3.

How does 2.67 come about?

Any hints would be appreciated. Also, a link directing me to code expansion knowledge would be helpful.

Actually, I'm not sure whether it's appropriate to ask a question on a paper because some key information may have been left out when transcribed here (the question may need reformatting). I just wish there could be someone who happens to have read the paper.

Thanks


Solution

  • The key point here is the following quotation from the same section:

    we pad all the unused keys (keyList[nKeys..2d-1]) in a node with the largest possible key

    Also is is important that in B+/CSB+ tree we search not for node values, but for intervals between these values. A set of possible values is split by 5 keys into 6 intervals.

    Since most of the right sub-tree is filled with the largest possible key (L), we need less comparisons than usual:

                  4
                /   \
               2     L
              / \   / \
             1   3 5   L
                      / \
                     L   L
    

    Right descendant of root node has largest possible key, there is no need to check any node to the right of it. And exactly 3 comparisons are needed for every interval:

    interval   comparisons
    up to 1    k>4, k>2, k>1
     1..2      k>4, k>2, k>1
     2..3      k>4, k>2, k>3
     3..4      k>4, k>2, k>3
     4..5      k>4, k>L, k>5
     5..L      k>4, k>L, k>5
    

    If we put the deepest subtree at the left, we have a tree, where non-empty nodes are placed one level deeper:

                  L
                /   \
               4     L
              / \   / \
             2   5 L   L
            / \
           1   3
    

    Search for node "1" in this tree requires to compare the key with 4 different nodes: L, 4, 2, and 1.

    If we know we have only five valid keys, we have the following tree:

                  2
                /   \
               1     4
                    / \
                   3   5
    

    Here we can arrange comparisons in a way, that gives 2.67 comparisons on average:

    interval   comparisons
    up to 1    k>2, k>1
    1..2       k>2, k>1
    2..3       k>2, k>4, k>3
    3..4       k>2, k>4, k>3
    4..5       k>2, k>4, k>5
    5..L       k>2, k>4, k>5
    

    "Code expansion" is not a widely used term, so I cannot give you the most relevant link. I think, this is not very different from "Loop unwinding".