sqloraclequery-optimization

Optimizing Oracle query join on a range of values (join on n between x and y)


I have two sizable (~1M rows each) Oracle tables that I need to join on a range of integers. For example,

Range_Descriptors
-----------------
range_id
from_value number(38)
to_value number(38)
descriptor

Values
------
value_id 
value number(38)
...

And so in this case I need to get the Range_Descriptors.descriptor for each record in Values where value falls between from_value and to_value.

In Range_Descriptors, the ranges are non-overlapping, and a very large percentage of the ranges are only one or two values (i.e. from_value = to_value or to_value+1), but even excluding those special cases only helps somewhat.

The most obvious query does return the expected results; however, it runs extremely slowly (~20 minutes):

select v.value_id, r.descriptor
from Values v join Range_Descriptors r on v.value between r.from_value and r.to_value;

I've played around with different indexing options, but I haven't found anything that significantly improves performance. Currently there are indexes on Range_Descriptors(from_value, to_value) and on Values(value). Suggestions?


Solution

  • This is a nasty problem with range keys (start/end dates, from/to values, etc.). While Oracle can use an index scan to eliminate one side (greater than or less than) of your range, that leaves the other side open-ended, and therein is the problem.

    There are two approaches I use for this situation depending on data volume.

    1. For smaller tables (1M rows is actually rather small), or for where you are going to do only a few lookups of specific values, simply create a concatenated (two-column) index on (from_value,to_value), and ensure your queries are doing a range scan on it (hint if they aren't). This will avoid hitting the table for any row that is out of range. However, it still requires a linked-list single block scan of the index leaf nodes for all older ranges that precede your desired date - that's because the binary search performed on the first column will be satisfied by the oldest value, and continue to be satisfied all the way up until it gets past the target value. So you end up doing a lot of index block gets.

    2. For larger tables (hundreds of millions or billions of rows), and when you are going to process most of the values (as in perhaps an ETL process or a materialized view or some daily job), you may need another strategy. What works well for me is to bucketize the range table based on the size of the range so that you can use an equality predicate in your joins for most rows, and minimize the rows that need to use an inequality. There's several ways to do that. Here's one from a system that historicizes using start/end IDs (which are elsewhere tied to timestamps). I often need to get the row that was active as of a specific snapshot ID. When the volume is truly massive, it requires some creativity:

         SELECT /*+ NO_MERGE USE_HASH(dt rt) */
                gd.driving_start_snapshot_id,
                rt.*
           FROM drivingtable dt,
                rangetable rt
          WHERE FLOOR(rt.composite_start_snapshot_id / 1000000) = FLOOR(rt.composite_end_snapshot_id / 1000000)
            AND FLOOR(dt.driving_start_snapshot_id / 1000000) = FLOOR(rt.composite_start_snapshot_id / 1000000)
            AND dt.driving_start_snapshot_id >= rt.composite_start_snapshot_id
            AND dt.driving_start_snapshot_id < rt.composite_end_snapshot_id
         UNION ALL
         SELECT /*+ NO_MERGE USE_HASH(dt rt) */
                dt.driving_start_snapshot_id,
                rt.*
           FROM drivingtable dt,
                rangetable rt
          WHERE FLOOR(rt.composite_start_snapshot_id / 1000000) != FLOOR(rt.composite_end_snapshot_id / 1000000)
            AND FLOOR(rt.composite_start_snapshot_id / 10000000) = FLOOR(rt.composite_end_snapshot_id / 10000000)
            AND FLOOR(dt.driving_start_snapshot_id / 10000000) = FLOOR(rt.composite_start_snapshot_id / 10000000)
            AND dt.driving_start_snapshot_id >= rt.composite_start_snapshot_id
            AND dt.driving_start_snapshot_id < rt.composite_end_snapshot_id
         UNION ALL
         SELECT /*+ NO_MERGE USE_HASH(dt rt) */
                dt.driving_start_snapshot_id,
                rt.*
           FROM drivingtable dt,
                rangetable rt
          WHERE FLOOR(rt.composite_start_snapshot_id / 10000000) != FLOOR(rt.composite_end_snapshot_id / 10000000)
            AND FLOOR(rt.composite_start_snapshot_id / 20000000) = FLOOR(rt.composite_end_snapshot_id / 20000000)
            AND FLOOR(dt.driving_start_snapshot_id / 20000000) = FLOOR(rt.composite_start_snapshot_id / 20000000)
            AND dt.driving_start_snapshot_id >= rt.composite_start_snapshot_id
            AND dt.driving_start_snapshot_id < rt.composite_end_snapshot_id
         UNION ALL
         SELECT /*+ NO_MERGE USE_HASH(dt rt) */
                dt.driving_start_snapshot_id,
                rt.*
           FROM drivingtable dt,
                rangetable rt
          WHERE FLOOR(rt.composite_start_snapshot_id / 20000000) != FLOOR(rt.composite_end_snapshot_id / 20000000)
            AND FLOOR(rt.composite_start_snapshot_id / 100000000) = FLOOR(rt.composite_end_snapshot_id / 100000000)
            AND FLOOR(dt.driving_start_snapshot_id / 100000000) = FLOOR(rt.composite_start_snapshot_id / 100000000)
            AND dt.driving_start_snapshot_id >= rt.composite_start_snapshot_id
            AND dt.driving_start_snapshot_id < rt.composite_end_snapshot_id
         UNION ALL
         SELECT /*+ NO_MERGE USE_MERGE(dt rt) */
                dt.driving_start_snapshot_id,
                rt.*
           FROM drivingtable dt,
                rangetable rt
          WHERE FLOOR(rt.composite_start_snapshot_id / 100000000) != FLOOR(rt.composite_end_snapshot_id / 100000000)
            AND dt.driving_start_snapshot_id >= rt.composite_start_snapshot_id
            AND dt.driving_start_snapshot_id < rt.composite_end_snapshot_id
      

    In this example, that first block says that if the range is within the same 1-million bucket (FLOOR of value / 1 million):

    FLOOR(rt.composite_start_snapshot_id / 1000000) = FLOOR(rt.composite_end_snapshot_id / 1000000)
    

    then do an equijoin on that bucket value:

    FLOOR(dt.driving_start_snapshot_id / 1000000) = FLOOR(rt.composite_start_snapshot_id / 1000000)
    

    Within that equijoin, refine results by the actual > < or between operator:

    AND dt.driving_start_snapshot_id >= rt.composite_start_snapshot_id
    AND dt.driving_start_snapshot_id < rt.composite_end_snapshot_id
    

    But the equijoin does most of the work and is where the performance gain comes in.

    For rows that span multiple 1-million blocks (!=), then try a wider block - like 10 million. If the start/end is in the same 10-million block, equijoin on that block value:

              FLOOR(rt.composite_start_snapshot_id / 1000000) != FLOOR(rt.composite_end_snapshot_id / 1000000)
          AND FLOOR(rt.composite_start_snapshot_id / 10000000) = FLOOR(rt.composite_end_snapshot_id / 10000000)
          AND FLOOR(dt.driving_start_snapshot_id / 10000000) = FLOOR(rt.composite_start_snapshot_id / 10000000)
    

    For rows that span multiple 10-million blocks, widen it again... try 20 million, then 100 million, Etc. Just ensure that each bucket size is a multiple of the one before it, to avoid the same row being processed in more than one block. Otherwise, you'd have to include the whole list of != buckets that preceded each block to ensure non-duplication of rows.

    Eventually you get to rows that have ranges so wide there is a diminishing returns in continuing to do this, as each one requires another pass of the data. So at some point you cut your losses and have a catch-all block that handles all the rows not handled by any of the above - that's the last one in my example:

                 FLOOR(rt.composite_start_snapshot_id / 100000000) != FLOOR(rt.composite_end_snapshot_id / 100000000)
             AND dt.driving_start_snapshot_id >= rt.composite_start_snapshot_id
             AND dt.driving_start_snapshot_id < rt.composite_end_snapshot_id
    

    That will do the inequality join without any equality join to help it, but by then hopefully you're dealing with a very small number of rows, so the performance impact is mitigated.

    This approach can be written into SQL in various ways, the above only being one way. But the overall concept is divide and conquer via discrete steps that find a way to employ an equality predicate for as many of the rows as possible. How many discrete steps and the bucket sizes you choose all depend on your data and where that point of diminishing returns is for making multiple passes. But in many occasions I've found that 3-5 discrete steps can really give me a major performance boost for range joins over doing them all at once.