postgresqlindexingb-treegist-index

Is the relationship between index tuple in GiST index and user table row many to one or one to one?


In a regular b-tree index, the leaf node contains a key and a pointer to the heap tuple (user table row), which signifies that in b-tree, the relationship between index tuple and user table row is one-to-one.

Just like in a b-tree, a GiST leaf node also contains a key datum and info about where the heap tuple is stored, but GiST leaves may or may not contain entire row data in its keys (please correct me if I'm wrong). So, if I am able to store one part of my table data in one leaf node and the other part in another leaf node and make both of them point to one heap tuple, would it be possible? This will make the relationship between GiST index tuple and heap tuple many to one.

Is all this correct?


Solution

  • A GiST index is a generalization of a B-tree index.

    In a non-leaf block of a B-tree index, two consecutive index entries define the boundary for the indexed values in the subtree at the destination of the pointer between these index entries:

    B-tree index

    In other words, each pointer to the next lower level is labeled with an interval that contains all values in the subtree.

    This only works for data types with a total ordering.

    The GiST index extends that concept. Each entry in a non-leaf node has a condition that the subtree under that index entry has to satisfy.

    When scanning a GiST index, I search the index page for all entries that may contain values matching my search condition. Since there is no total ordering, it is possible (but of course not desirable) that the conditions somehow “overlap” so that something I search for can have matches in more than one of the entries. In that case I have to descend into all the referenced subtrees, but I can skip those where the entry's condition guarantees that the subtree cannot contain entries that match my search condition.

    This is a little abstract, so let's flesh it out with an example.

    One of the classical examples of a GiST index is an R-tree index, a kind of geographical index like it is used by PostGIS:

    R-traa index

    Here the condition of an index entry is a bounding box that contains the bounding boxes of all geometries contained in the subtree of the index entry. So whan searching for a geometry, I take its bounding box and see which of the index entries in a page contains this bounding box. These are the subtrees into which I have to descend.

    One thing that can be seen in this example is that a GiST index can be lossy, that is, it gives me a neccesary, but not sufficient condition if I have found a hit. The leaf entries found in a GiST index scan always have to be rechecked if the actual table entry also satisfies the condition (not every geometry is a rectangle). This is why a GiST index scan is always a bitmap index scan in PostgreSQL.

    This all sounds nice and simple. The hard part about a good GiST index is the picksplit algorithm that decides upon an index page split which of the index entries comes into which of the two new index pages. The better this works, the more efficient the index will be.

    So you see, a GiST index is “somewhat like” a B-tree index in many respects. You can see a B-tree index as an optimized special case of a GiST index (see the btree-gist contrib module).

    This lets me answer your questions:

    GiST leaf node also contains key datum and info about where the heap tuple is stored

    This is true.

    GiST leaves may or may not contain entire row data in its keys

    Of course the index entry does not contain the entire row. But I think you mean the right thing. The condition in the GiST leaf can be broader than the actual object in the table, like a bounding box is bigger than a geometry.

    if I am able to store one part of my table data in one leaf node and the other part in another leaf node and make both of them point to one heap tuple, would it be possible? This will make the relationship between GiST index tuple and heap tuple many to one.

    This is wrong. Even though a value may satisfy several of the entries in a GiST index page, it is only contained in one of the subtrees, and only one leaf page entry points to any given table row. It is a one-to-one relationship, just like in a B-tree index.