postgresqlindexingpostgresql-performance

Why is Bitmap Scan faster than Index Scan when fetching a moderately large percentage of the table in PostgreSQL?


The author of Bitmap Scan described the difference between Bitmap Heap Scan and Index Scan:

A plain indexscan fetches one tuple-pointer at a time from the index, and immediately visits that tuple in the table. A bitmap scan fetches all the tuple-pointers from the index in one go, sorts them using an in-memory "bitmap" data structure, and then visits the table tuples in physical tuple-location order. The bitmap scan improves locality of reference to the table at the cost of more bookkeeping overhead to manage the "bitmap" data structure --- and at the cost that the data is no longer retrieved in index order, which doesn't matter for your query but would matter if you said ORDER BY.

Questions:

  1. Why does it sort the fetched tuple-pointers again, when the index is already sorted?

  2. How does it sort with bitmap? I know what bitmap is, but I don't understand how it can be used for sorting.

  3. Why is it faster than Index Scan when fetching a moderately large percentage of the table? On the contrary, it seems to add quite a bit of computing to the process.


Solution

  • The backbone of Postgres storage is made up of data pages of 8 kilobytes in a typical installation. Each data page typically holds many tuples. Read the details of physical storage in the manual.

    The "bitmap" in a bitmap scan is a way to collect tuple pointers in buckets of data pages. Index sort order is necessarily lost in this process in favor of physical sort order. In "lossy mode" (which only occurs if the result is so huge or workmem so small that even the tiny bitmap won't fit) only block numbers are kept and corresponding tuple indexes discarded.

    After that, each data page is only visited once from storage to extract (possibly) multiple tuples and in physical sequence, which also matters for some types of storage. In lossy mode, tuples from each identified page have to be filtered by rechecking the index condition; else tuples can be retrieved directly using collected tuple indexes.

    In an index scan, each page may have to be visited multiple times if multiple tuples end up to be stored in the same data page. The actual process is even more sophisticated. Related:

    To your questions:

    1. The sorting of the index is lost due to collecting hits and reading them out data page by data page.

    2. Hence, the result has to be sorted again, in an added sort step - if the sort order is required (by ORDER BY, for instance).

    3. The number of data pages that have to be read is the most prominent factor for overall performance. Bitmap index scans reduce that number to a minimum. With faster storage, the benefit of bitmap index scans gets smaller. That's why accurate cost settings are crucial for the query planner to make good decisions.