I'm trying to figure out a way to reason about a fairly simple problem:
A table
foo
has two columns,id
with unique identifiers, andvalue
with a randomnumeric
value. Given a constantthreshold
value, find the groups of values where theirvalue
is within that threshold - the count of that group, and the group's average value.
Let's set the stage - here is the foo
table (in PostgreSQL):
CREATE TABLE foo (
id serial PRIMARY KEY,
value numeric NOT NULL
);
INSERT INTO foo(value) VALUES (1),(2),(2),(3),(6),(7),(7),(8);
Which gives the following data:
id | value |
---|---|
1 | 1 |
2 | 2 |
3 | 2 |
4 | 3 |
5 | 6 |
6 | 7 |
7 | 7 |
8 | 8 |
The query should return counts and averages - entries cannot be double-counted, so I'd like the rows to be "cut into groups" based on some threshold "near-ness" value - you can think of the threshold
value as a radius (for my example, a radius of 1.5
would be what I want):
SELECT AVG(foo.value), COUNT(foo.id)
FROM foo
GROUP BY foo.value; -- where I'm stuck
The query above will only apply the aggregate functions to rows in which foo.value
has exact duplicates - the following is the result of this query:
AVG(foo.value) | COUNT(foo.id) |
---|---|
1 | 1 |
2 | 2 |
3 | 1 |
6 | 1 |
7 | 2 |
8 | 1 |
what I'd like is something like the following pseudo-SQL:
SELECT AVG(foo.value), COUNT(foo.id)
FROM foo
GROUP BY
(real(row_being_grouped.value) <= real(foo.value) + 1.5)
AND
(real(row_being_grouped.value) >= real(foo.value) - 1.5);
Where the result would look like the following:
AVG(foo.value) | COUNT(foo.id) |
---|---|
2 | 4 |
7 | 4 |
I'm not sure if this makes any sense. I am wondering if I can get away with this without doing a sub-query - maybe "bucket" the rows somehow before finding the average within the buckets?
EDIT:
Before writing this question, I was unaware of window functions, and see how critical they are to this question - window functions could be used to process this data in a sub-query.
If "threshold" can be understood as the maximum gap allowed between values of the same group, then it's a well defined gaps-and-islands problem.
Here is a solution:
SELECT grp, count(*) AS grp_count, round(avg(value), 2) AS grp_avg
FROM (
SELECT count(gap) OVER (ORDER BY value) AS grp, *
FROM (
SELECT value
, value - lag(value) OVER (ORDER BY value) > 150 OR null AS gap
FROM foo
) sub1
) sub2
GROUP BY grp
ORDER BY grp;
fiddle (with step-by-step demo)
Explanation and links to more:
For convenience and short code I use Boolean logic (true OR null
→ true
, false OR null
→ null
), and the fact that count()
ignores null
values. See:
If this is more of a raster / cluster / granulation / grid problem, you need to define exactly which rows to pick as "focal points" or, alternatively, the exact nature of the independent raster / grid.