This is my first question to the stack overflow.
We are doing statistics for simulation data and I wanted to implement the bootstrap in Oracle-SQL (sampling data with replacement). My approach is as follows:
However, there seems to be a problem with joins. Oracle is losing some rows.
Here is an example. When I do the following query, I get a random ordering of the rows:
WITH basis as (
SELECT 'A' AS KPI_KEY, 2 AS KPI_VALUE FROM dual
union all
SELECT 'A' AS KPI_KEY, 5 AS KPI_VALUE FROM dual
union all
SELECT 'A' AS KPI_KEY, 3 AS KPI_VALUE FROM dual
), group_counts AS (
SELECT KPI_KEY, COUNT(*) as total_count
FROM basis
GROUP BY KPI_KEY
)
, random_numbers AS (
SELECT KPI_KEY, ceil(dbms_random.value(0, max(total_count))) AS rand_rn
FROM group_counts
GROUP BY KPI_KEY
CONNECT BY LEVEL <= total_count AND PRIOR KPI_KEY = KPI_KEY AND PRIOR dbms_random.value IS NOT NULL
)
select * from random_numbers
;
Result (correct): KPI_KEY RAND_RN 'A' 1 'A' 2 'A' 2
Now, when I join the numbered rows like the following, the result is wrong:
WITH basis as (
SELECT 'A' AS KPI_KEY, 2 AS KPI_VALUE FROM dual
union all
SELECT 'A' AS KPI_KEY, 5 AS KPI_VALUE FROM dual
union all
SELECT 'A' AS KPI_KEY, 3 AS KPI_VALUE FROM dual
), group_counts AS (
SELECT KPI_KEY, COUNT(*) as total_count
FROM basis
GROUP BY KPI_KEY
)
, numbered_rows AS (
SELECT KPI_KEY, KPI_VALUE, ROW_NUMBER()
OVER (PARTITION BY KPI_KEY ORDER BY dbms_random.value) AS rn, total_count
FROM basis
JOIN group_counts USING (KPI_KEY)
)
, random_numbers AS (
SELECT KPI_KEY, ceil(dbms_random.value(0, max(total_count))) AS rand_rn
FROM group_counts
GROUP BY KPI_KEY
CONNECT BY LEVEL <= total_count AND PRIOR KPI_KEY = KPI_KEY AND PRIOR dbms_random.value IS NOT NULL
)
SELECT rn.KPI_KEY, nr.KPI_VALUE, nr.rn
FROM random_numbers rn
LEFT JOIN numbered_rows nr
ON rn.KPI_KEY = nr.KPI_KEY
and rn.rand_rn = nr.rn
;
Result (value is random, but missing rows from join): KPI_KEY KPI_VALUE RN 'A' 5 3
Someone who has had a similar problem before? Seems like an Oracle bug to me, but maybe I missed an important detail.
I tried the materialized hint, and also the rownum approach found here: Random join in oracle
The issue is that the GROUP BY
of the penultimate query is aggregating all the generated rows into a single row and the CONNECT BY
clause effectively does unnecessary work for zero benefit.
This appears to be because the SQL engine in rewriting the query and if you take the output from the penultimate query, without the ultimate query, then it performs GROUP BY
then CONNECT BY
and if you perform both the penultimate and ultimate queries then it performs CONNECT BY
and then GROUP BY
(in the reverse order).
EXPLAIN PLAN FOR
WITH basis as (
SELECT 'A' AS KPI_KEY, 2 AS KPI_VALUE FROM dual
union all
SELECT 'A' AS KPI_KEY, 5 AS KPI_VALUE FROM dual
union all
SELECT 'A' AS KPI_KEY, 3 AS KPI_VALUE FROM dual
), group_counts AS (
SELECT KPI_KEY, COUNT(*) as total_count
FROM basis
GROUP BY KPI_KEY
)
, numbered_rows AS (
SELECT KPI_KEY, KPI_VALUE, ROW_NUMBER()
OVER (PARTITION BY KPI_KEY ORDER BY dbms_random.value) AS rn, total_count
FROM basis
JOIN group_counts USING (KPI_KEY)
)
, random_numbers AS (
SELECT KPI_KEY, ceil(dbms_random.value(0, max(total_count))) AS rand_rn
FROM group_counts
GROUP BY KPI_KEY
CONNECT BY LEVEL <= total_count AND PRIOR KPI_KEY = KPI_KEY AND PRIOR dbms_random.value IS NOT NULL
)
SELECT *
FROM random_numbers rn
;
Gives the plan:
PLAN_TABLE_OUTPUT
Plan hash value: 448364244
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 3 | 48 | 7 (15)| 00:00:01 |
| 1 | VIEW | | 3 | 48 | 7 (15)| 00:00:01 |
|* 2 | CONNECT BY WITHOUT FILTERING| | | | | |
| 3 | VIEW | | 3 | 48 | 7 (15)| 00:00:01 |
| 4 | HASH GROUP BY | | 3 | 9 | 7 (15)| 00:00:01 |
| 5 | VIEW | | 3 | 9 | 6 (0)| 00:00:01 |
| 6 | UNION-ALL | | | | | |
| 7 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
| 8 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
| 9 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
--------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("KPI_KEY"=PRIOR "KPI_KEY")
filter("TOTAL_COUNT">=LEVEL AND PRIOR "DBMS_RANDOM"."VALUE"() IS NOT
NULL)
And:
EXPLAIN PLAN FOR
WITH basis as (
SELECT 'A' AS KPI_KEY, 2 AS KPI_VALUE FROM dual
union all
SELECT 'A' AS KPI_KEY, 5 AS KPI_VALUE FROM dual
union all
SELECT 'A' AS KPI_KEY, 3 AS KPI_VALUE FROM dual
), group_counts AS (
SELECT KPI_KEY, COUNT(*) as total_count
FROM basis
GROUP BY KPI_KEY
)
, numbered_rows AS (
SELECT KPI_KEY, KPI_VALUE, ROW_NUMBER()
OVER (PARTITION BY KPI_KEY ORDER BY dbms_random.value) AS rn, total_count
FROM basis
JOIN group_counts USING (KPI_KEY)
)
, random_numbers AS (
SELECT KPI_KEY, ceil(dbms_random.value(0, max(total_count))) AS rand_rn
FROM group_counts
GROUP BY KPI_KEY
CONNECT BY LEVEL <= total_count AND PRIOR KPI_KEY = KPI_KEY AND PRIOR dbms_random.value IS NOT NULL
)
SELECT rn.*, nr.*, ROWNUM
FROM random_numbers rn
FULL OUTER JOIN numbered_rows nr
ON rn.KPI_KEY = nr.KPI_KEY
and rn.rand_rn = nr.rn;
Gives the plan:
Plan hash value: 1952896670
-----------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 96 | 17 (18)| 00:00:01 |
| 1 | TEMP TABLE TRANSFORMATION | | | | | |
| 2 | LOAD AS SELECT (CURSOR DURATION MEMORY)| SYS_TEMP_0FD9D698B_58974E7 | | | | |
| 3 | UNION-ALL | | | | | |
| 4 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
| 5 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
| 6 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
| 7 | LOAD AS SELECT (CURSOR DURATION MEMORY)| SYS_TEMP_0FD9D698C_58974E7 | | | | |
| 8 | HASH GROUP BY | | 1 | 3 | 3 (34)| 00:00:01 |
| 9 | VIEW | | 3 | 9 | 2 (0)| 00:00:01 |
| 10 | TABLE ACCESS FULL | SYS_TEMP_0FD9D698B_58974E7 | 3 | 48 | 2 (0)| 00:00:01 |
| 11 | COUNT | | | | | |
| 12 | VIEW | VW_FOJ_0 | 2 | 96 | 8 (25)| 00:00:01 |
|* 13 | HASH JOIN FULL OUTER | | 2 | 96 | 8 (25)| 00:00:01 |
| 14 | VIEW | | 1 | 16 | 3 (34)| 00:00:01 |
| 15 | HASH GROUP BY | | 1 | 16 | 3 (34)| 00:00:01 |
|* 16 | CONNECT BY WITHOUT FILTERING | | | | | |
| 17 | VIEW | | 3 | 48 | 2 (0)| 00:00:01 |
| 18 | TABLE ACCESS FULL | SYS_TEMP_0FD9D698C_58974E7 | 3 | 9 | 2 (0)| 00:00:01 |
| 19 | VIEW | | 1 | 32 | 5 (20)| 00:00:01 |
| 20 | WINDOW SORT | | 1 | 22 | 5 (20)| 00:00:01 |
|* 21 | HASH JOIN | | 1 | 22 | 4 (0)| 00:00:01 |
| 22 | VIEW | | 3 | 18 | 2 (0)| 00:00:01 |
| 23 | TABLE ACCESS FULL | SYS_TEMP_0FD9D698B_58974E7 | 3 | 48 | 2 (0)| 00:00:01 |
| 24 | VIEW | | 3 | 48 | 2 (0)| 00:00:01 |
| 25 | TABLE ACCESS FULL | SYS_TEMP_0FD9D698C_58974E7 | 3 | 9 | 2 (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
13 - access("RN"."KPI_KEY"="NR"."KPI_KEY" AND "RN"."RAND_RN"="NR"."RN")
16 - access("KPI_KEY"=PRIOR "KPI_KEY")
filter("TOTAL_COUNT">=LEVEL AND PRIOR "DBMS_RANDOM"."VALUE"() IS NOT NULL)
21 - access("BASIS"."KPI_KEY"="GROUP_COUNTS"."KPI_KEY")
It may be possible to fix the query and materialise the output from the penultimate query so that the EXPLAIN PLAN
is consistently generated with the ultimate part of your query; however, this seems to be a fragile solution and you would probably be better to rewrite the query entirely to use something that is (a) simpler and (b) less impacted by the compiler rewriting the query.
Assuming that your logic is:
For each kpi_key
:
kpi_key
(there is a bug in your logic using CEIL(DBMS_RANDOM.VALUE(0, num_rows))
as DBMS_RANDOM
will include the lower-bound and exclude the upper-bound so you can get values starting from 0, which you do not want to include, and rounding up to num_rows
but the probabilities are not even as the upper-bound is excluded; instead you want to use FLOOR
and add 1).Self-join the rows so that each row is joined to its random counter-part.
Then you can rewrite the query using analytic functions:
WITH basis (kpi_key, kpi_value) as (
SELECT 'A', 2 FROM dual union all
SELECT 'A', 5 FROM dual union all
SELECT 'A', 3 FROM dual
),
random_numbers AS (
SELECT kpi_key,
kpi_value,
1 + FLOOR(DBMS_RANDOM.VALUE(0, COUNT(*) OVER (PARTITION BY kpi_key))) AS rand_rn,
ROW_NUMBER() OVER (PARTITION BY kpi_key ORDER BY ROWNUM) AS rn
FROM basis
)
select original.kpi_key AS o_kpi_key,
original.kpi_value AS o_kpi_value,
original.rn AS o_rn,
rand.kpi_key AS r_kpi_key,
rand.kpi_value AS r_kpi_value,
rand.rn AS r_rn
from random_numbers original
INNER JOIN random_numbers rand
ON original.rand_rn = rand.rn;
Note: There is no point randomly ordering the ROW_NUMBER
output as you are comparing it to a random value and you will get the same randomness comparing a static-ordered list to random values within that list as you would when comparing a randomly-ordered list to random values within that list.
Which may randomly output:
O_KPI_KEY | O_KPI_VALUE | O_RN | R_KPI_KEY | R_KPI_VALUE | R_RN |
---|---|---|---|---|---|
A | 5 | 2 | A | 3 | 3 |
A | 2 | 1 | A | 3 | 3 |
A | 3 | 3 | A | 2 | 1 |
Note: You probably just want the last 3 columns corresponding to the random row selection.