I'm using postgresql in DBT, so either a postgresql or a DBT solution is viable.
I have a table with fundamentally 4 columns, target, question, value, current
I can get the percentile (cume_dist
) of the value with:
(cume_dist() over (partition by question order by value)) * 100 as percentile
However, targets where the current is zero aren't in the comparison. I need the percentile of them, but they aren't in the "dataset" (they are historical data from the same targets, so they aren't current, but I need the percentile as if they were current)
However, I effectively need the percentile of every value, within the subset of values where current
is 1.
ETA:
I see in the documentation that cume_dist
can take args, but I haven't found any documentation on what those args represent or how to use them...
Re-implement cume_dist()
As far as I know, there is no way to do this with the existing cume_dist()
function, but it's pretty easy to implement the same functionality without cume_dist()
.
The postgres docs say cume_dist()
Returns the cumulative distribution, that is (number of partition rows preceding or peers with current row) / (total partition rows). The value thus ranges from 1/N to 1.
Here is how I set up my demo data:
CREATE TABLE my_table (
target VARCHAR,
question VARCHAR,
value INT,
current BOOL
);
INSERT INTO my_table VALUES
('alice', 'number of hats', 7, true),
('alice', 'number of hats', 6, false),
('alice', 'number of hats', 3, false),
('bob', 'number of hats', 4, true),
('charles', 'number of hats', 9, true),
('david', 'number of hats', 0, true),
('erin', 'number of hats', 4, true),
('alice', 'number of fingers', 10, true),
('bob', 'number of fingers', 10, true),
('charles', 'number of fingers', 9, true),
('charles', 'number of fingers', 10, false),
('david', 'number of fingers', 10, true),
('erin', 'number of fingers', 10, true);
and here is how to do that same thing (rows preceding) / (total rows) without cume_dist()
and with some added logic to exclude rows where that aren't current:
SELECT
target,
question,
value,
current,
(
-- this is the count of partition rows preceding this row
SELECT COUNT(1)
FROM my_table AS inner_table
WHERE
inner_table.question = outer_table.question AND
current = true AND
inner_table.value <= outer_table.value
)::real / (
-- this is the total number of rows for this question
SELECT COUNT(1)
FROM my_table AS inner_table
WHERE
inner_table.question = outer_table.question AND
current = true
)::real * 100.0 AS percentile
FROM my_table AS outer_table;
Here is a fiddle link with this query