I'm looking for the best X% of data in a data set, where "best" is defined as having a minimum sum of values. I can do this by running a bunch of tests to cast around for the result I want:
SELECT
-- Analyze results manually, looking for a testXX value which is close to X%.
-- If needed, edit the query for higher precision and try again.
1.0 * count_if(f1+f2 < 0.1)/count(1) AS test01,
1.0 * count_if(f1+f2 < 0.2)/count(1) AS test02,
...
FROM table1
I tried joining with a SEQUENCE for a little less copy&paste, but besides making the query much more memory-intensive, I couldn't get it to work. Here's what I tried:
SELECT 1.0 * count_if(f1+f2 < threshold)/count(1) AS test
FROM table1
JOIN (SELECT t.v/100.0 AS threshold FROM UNNEST(SEQUENCE(20, 80, 1)) t(v))
ON true
What I really want is a query which will automatically find a threshold that's equal to X +- some epsilon, or better, a threshold as close as possible to X.
Simplified sample data
f1 f2
0.04 0.05
0.02 0.07
0.02 0.069
0.1 0.1
0.1 0.3
0.1 0.4
0.1 0.5
0.1 0.6
0.1 0.7
0.1 0.8
If my target X is 0.3, I want the threshold to be around 0.09, because 30% of f1+f2 are <=0.09. The real data set has tens of millions of rows with far more random values. If I want a 30% slice, it's okay if it's actually 30.2% or 29.8%.
CREATE TABLE sample (
f1 DECIMAL(4,3),
f2 DECIMAL(4,3)
)
INSERT INTO
sample
VALUES
(0.04, 0.05),
(0.02, 0.07),
(0.02, 0.069), -- I changed this value
(0.1 , 0.1),
(0.1 , 0.3),
(0.1 , 0.4),
(0.1 , 0.5),
(0.1 , 0.6),
(0.1 , 0.7),
(0.1 , 0.8)
WITH
ranked AS
(
SELECT
*,
f1+f2 AS x,
ROW_NUMBER()
OVER (ORDER BY f1+f2, f1, f2)
*
1.0
/
COUNT(*) OVER ()
AS percentile
FROM
sample
)
SELECT
MAX(CASE WHEN percentile <= 0.3 THEN x END),
MIN(CASE WHEN percentile > 0.3 THEN x END)
FROM
ranked
max | min |
---|---|
0.090 | 0.200 |
The 30% cut off can be any value from 0.090 up to (but not including) 0.200