sqlpresto

How to search for a value that matches a condition


I'm looking for the best X% of data in a data set, where "best" is defined as having a minimum sum of values. I can do this by running a bunch of tests to cast around for the result I want:

SELECT
  -- Analyze results manually, looking for a testXX value which is close to X%.
  -- If needed, edit the query for higher precision and try again.
  1.0 * count_if(f1+f2 < 0.1)/count(1) AS test01,
  1.0 * count_if(f1+f2 < 0.2)/count(1) AS test02,
  ...
FROM table1

I tried joining with a SEQUENCE for a little less copy&paste, but besides making the query much more memory-intensive, I couldn't get it to work. Here's what I tried:

SELECT 1.0 * count_if(f1+f2 < threshold)/count(1) AS test
FROM table1
JOIN (SELECT t.v/100.0 AS threshold FROM UNNEST(SEQUENCE(20, 80, 1)) t(v))
  ON true

What I really want is a query which will automatically find a threshold that's equal to X +- some epsilon, or better, a threshold as close as possible to X.

Simplified sample data

f1    f2
0.04  0.05
0.02  0.07
0.02  0.069
0.1   0.1
0.1   0.3
0.1   0.4
0.1   0.5
0.1   0.6
0.1   0.7
0.1   0.8

If my target X is 0.3, I want the threshold to be around 0.09, because 30% of f1+f2 are <=0.09. The real data set has tens of millions of rows with far more random values. If I want a 30% slice, it's okay if it's actually 30.2% or 29.8%.


Solution

  • CREATE TABLE sample (
      f1   DECIMAL(4,3),
      f2   DECIMAL(4,3)
    )
    
    INSERT INTO
      sample
    VALUES
    (0.04,  0.05), 
    (0.02,  0.07), 
    (0.02,  0.069), -- I changed this value
    (0.1 ,  0.1), 
    (0.1 ,  0.3), 
    (0.1 ,  0.4), 
    (0.1 ,  0.5), 
    (0.1 ,  0.6), 
    (0.1 ,  0.7), 
    (0.1 ,  0.8) 
    
    WITH
      ranked AS
    (
      SELECT
        *,
        f1+f2 AS x,
        ROW_NUMBER()
          OVER (ORDER BY f1+f2, f1, f2)
        *
        1.0
        /
        COUNT(*) OVER ()
          AS percentile
      FROM
        sample
    )
    SELECT
      MAX(CASE WHEN percentile <= 0.3 THEN x END),
      MIN(CASE WHEN percentile >  0.3 THEN x END)
    FROM
      ranked
    
    max min
    0.090 0.200

    The 30% cut off can be any value from 0.090 up to (but not including) 0.200

    fiddle