sqlpostgresqlgroup-byaggregate-functionsbucket

Inexact GROUP BY aggregation


I'm trying to figure out a way to reason about a fairly simple problem:

A table foo has two columns, id with unique identifiers, and value with a random numeric value. Given a constant threshold value, find the groups of values where their value is within that threshold - the count of that group, and the group's average value.

Let's set the stage - here is the foo table (in PostgreSQL):

CREATE TABLE foo (
    id serial PRIMARY KEY,
    value numeric NOT NULL
);
INSERT INTO foo(value) VALUES (1),(2),(2),(3),(6),(7),(7),(8);

Which gives the following data:

id value
1 1
2 2
3 2
4 3
5 6
6 7
7 7
8 8

The query should return counts and averages - entries cannot be double-counted, so I'd like the rows to be "cut into groups" based on some threshold "near-ness" value - you can think of the threshold value as a radius (for my example, a radius of 1.5 would be what I want):

SELECT AVG(foo.value), COUNT(foo.id)
  FROM foo
  GROUP BY foo.value; -- where I'm stuck

The query above will only apply the aggregate functions to rows in which foo.value has exact duplicates - the following is the result of this query:

AVG(foo.value) COUNT(foo.id)
1 1
2 2
3 1
6 1
7 2
8 1

what I'd like is something like the following pseudo-SQL:

SELECT AVG(foo.value), COUNT(foo.id)
  FROM foo
  GROUP BY
   (real(row_being_grouped.value) <= real(foo.value) + 1.5)
   AND
   (real(row_being_grouped.value) >= real(foo.value) - 1.5);

Where the result would look like the following:

AVG(foo.value) COUNT(foo.id)
2 4
7 4

I'm not sure if this makes any sense. I am wondering if I can get away with this without doing a sub-query - maybe "bucket" the rows somehow before finding the average within the buckets?

EDIT:

Before writing this question, I was unaware of window functions, and see how critical they are to this question - window functions could be used to process this data in a sub-query.


Solution

  • If "threshold" can be understood as the maximum gap allowed between values of the same group, then it's a well defined problem.
    Here is a solution:

    SELECT grp, count(*) AS grp_count, round(avg(value), 2) AS grp_avg
    FROM  (
       SELECT count(gap) OVER (ORDER BY value) AS grp, *
       FROM  (
          SELECT value
               , value - lag(value) OVER (ORDER BY value) > 150 OR null AS gap
          FROM   foo
          ) sub1
       ) sub2
    GROUP  BY grp
    ORDER  BY grp;
    

    fiddle (with step-by-step demo)

    Explanation and links to more:

    For convenience and short code I use Boolean logic (true OR nulltrue, false OR nullnull), and the fact that count() ignores null values. See:

    If this is more of a raster / cluster / granulation / grid problem, you need to define exactly which rows to pick as "focal points" or, alternatively, the exact nature of the independent raster / grid.