innodbb-tree

Is innoDB clustered index always bigger than data size?


In innoDB, the clustered index B+ tree stores data in leaf nodes. I believe it stores the actual data records, not the pointer to data. This is different from postgres, where index leaf node stores pointer to the heap file.

This has some serious implications.

  1. innoDB index file should contain all the records, then it should always be bigger than actual data size.
  2. Very likely innoDB index file can't fit in memory. e.g. 100GB data, 8GB RAM. If clustered index contains data file at leaf node, index can't fit in memory.

Are these true?

Coming from postgres, since index file doesn't store data, the index size should be much smaller, hence more likely to fit in memory. But in innoDB, most likely clustered index can't fit in memory, since index is essentially the table.

Some sources From link, under MySQL's InnoDB, the table data and the primary key index are stored in the same data structure.

Some other links https://dba.stackexchange.com/questions/41232/performance-difference-between-mysql-and-postgresql-for-the-same-schema-queries/49062#49062


Solution

  • Those are sort of true.

    The B+Tree contains all the data; that is, all the columns, including the PRIMARY KEY column(s). It is sorted by the PK; the blocks are 16KB each. So, typically, each block is not entirely full.

    And, since there are a lot of over "overhead" stuff, an estimate of the bytes needed for the data should be multiplied by 2 to 3 to get the disk space needed for all the blocks.

    A "point query" by the PK is a straightforward BTree lookup and very fast. A billion rows might have about 5 levels of BTree.

    The blocks of data, secondary indexes, and other miscellany are cached in RAM in the "buffer pool", whose size is configured by innodb_buffer_pool_size. For 8GB of RAM, I recommend no more than 6G.

    The overhead for having the PK embedded with the data is about 1%. (Hence, I quibble with your "bigger than actual data size".) That 1% is a crude estimate of the extra "index" blocks needed in addition to the data blocks that, of course, must exist.

    As a cache, the blocks come and go as needed. So, your 100GB of data will work even with only 5GB of cache. The downside is the extra I/O. Depending on the access patterns, there may not be much downside.

    MySQL does not attempt to "keep the index in RAM" (or anything like that). It only shuffles blocks in/out in roughly a least-recently-used pattern. If your access pattern does not involve a table scan or index scan, I suggest that "index can't fit in memory" is not a big concern.

    More

    Rephrasing... The Clustered index (PK) and the data "coexist" in the same B+Tree. They share the PK column(s); the clustered index contributes ordering (and B+Tree overhead); the "data" provides the rest of the columns.

    True, the clustered index is not separable. Meanwhile, each secondary index is in a separate B+Tree structure containing

    * the column(s) of the secondary index
    * the column(s) of the PK (so a lookup can get to the rest of the columns
    

    All those B+Trees live in a single .ibd file on disk.

    MySQL "never" uses Hashing. (There are a few obscure situations, mostly internal.)

    All blocks (data/index; leaf/non-leaf) are cached in the buffer_pool. Non-leaf index blocks (including the PK) tend to stay cached; leaf nodes tend to get flushed out. But there is no rule; it depends on the queries. So, again, I say "sort of true". That is, random point queries via the PK for a table of your size will usually need 1 I/O (namely for the leaf node).

    A point query via a secondary index must first do like above in its B+Tree, then do another lookup via the PK to get the rest of the columns. (A "covering" index is an exception.)

    It can be useful to think of a secondary index as being identical in structure to the data's B+Tree:

    PS. MySQL's defunct ENGINE=MyISAM smells a lot like Postgres's PK, except that the link from any index (clustered or secondary) is not via hash; instead via an offset into the .MYD file. Then it uses fseek.

    Different queries run at different speeds depending on InnoDB vs MyISAM vs Postgres vs etc. There is no universally-best way to implement an RDBMS. InnoDB does a very good job.