data-structuresb-tree

How much do B-trees reduce disk accesses?


I have just read B-trees data structure and I have some questions. I have a doubt in my mind that is not explained in any of the blogs (maybe it's too obvious and I am missing it).

B-trees are supposed to reduce disk accesses by reducing the height of tree. So, if reducing the number of disk accesses is the main concern then how much difference it makes? Suppose I just use binary trees then my nodes need less space than nodes of n-ary B-trees. So I can accommodate more nodes in a page as I can do with fat B-trees nodes. How does it affect disk accesses exactly? Are we talking about worst case only?


Solution

  • You have to understand that B-trees are usually used in systems where you have paged data access. This is most commonly a database system. A page is essentially a block of memory which you have to read (and write) at once. You cannot read individual parts of the page without reading the whole page.

    The important thing is: Reading pages from disk into memory is expensive; way more expensive than doing whatever with a page that is already in memory. As such, you want to minimize the number of pages you have to read.

    B-trees have several benefits over binary trees for that purpose—which is little surprising considering that they were especially designed for that purpose.

    One of these benefits is a reduced height. If you take normal binary trees, you can quickly search within those. But while doing that, you walk very deep into the tree. A binary tree with 1 million elements already has a depth of 20. So, assuming it is balanced, you need to walk down 20 nodes. Comparing that with a B-tree, the height is a lot lower. With a configured children count of 10 (which would be very low btw.) we already have the height down to approximately 6. So we need to make a lot less comparisons, and likely load a lot less pages. Usually, the order of a B-tree (that is the number of children each node has) is chosen in a way, so a single node fills a complete page. Now that may sound stupid as you would need to search within that node’s keys then, but it heavily reduces the depth and as such the amount of pages you have to read.

    Another benefit is that B-trees are balanced. This ensures that all nodes at all time are filled approximately with an equal amount of children. Often, this is about 75% of its capacity. Since nodes fill a complete page, this means that every page holding a node is filled to that amount of its capacity. That is very good since that optimizes the space used by nodes and avoids holes in pages that don’t contain information (this would be a big issue with binary trees since they are not balanced by design). Another very important impact is that this also ensures that the number of operations (and as such the run time) to find elements is consistent. So you have a very predictable performance for all cases. For databases, this is usually a lot more important than having better best or average cases which may vary in performance.

    There are other benefits too, like having leaves not only all at the same level but also being physically located close to each other, as this improves seek time when iterating through elements.

    Basically, B-trees are optimized for paged data access which makes them very special and fine-tuned for those purposes, allowing them to outperform classic binary trees (which are simpler and more efficient in many other applications).