I want each cluster to have a maximum of 20
items. Here is my code in PostgreSQL with PostGIS extension:
WITH RECURSIVE clustered_data AS (-- Step 1: Perform initial clustering
SELECT pma.*
, ST_ClusterKMeans(ST_Transform("centerPoint", 24313), 1, 20000)
OVER (PARTITION BY "areaID" ORDER BY "ID") AS "nta_cluster_id" -- Initial clustering
, 1 AS "recursion_depth" -- Initialize recursion depth
FROM "potential_missed_areas" pma
WHERE pma."campaignid" = -1
AND pma."status" = 'approved'
),cluster_sizes AS (-- Step 2: Calculate cluster sizes
SELECT "nta_cluster_id"
, COUNT(*) AS "cluster_size"
FROM clustered_data
GROUP BY "nta_cluster_id"
),recursive_clusters AS (-- Base case: Start with clusters from previous CTE
SELECT cd.*--column list skipped for brevity
, cd."nta_cluster_id" -- Include nta_cluster_id
, cs."cluster_size" -- Include cluster size from the previous part
, cd."recursion_depth" -- Include recursion_depth from base case
FROM clustered_data cd
JOIN cluster_sizes cs ON cd."nta_cluster_id" = cs."nta_cluster_id"
UNION ALL-- Recursive case: Re-cluster if any cluster has more than 20 items
SELECT rc.* --column list skipped for brevity
, rc."nta_cluster_id" -- Include nta_cluster_id
, cs."cluster_size" -- Include cluster size from the previous part
, rc."recursion_depth" + 1 AS "recursion_depth" -- Increment recursion depth
FROM recursive_clusters rc
JOIN cluster_sizes cs ON rc."nta_cluster_id" = cs."nta_cluster_id"
WHERE cs."cluster_size" > 20 -- Only re-cluster if there are more than 20 items
AND rc."recursion_depth" < 100 -- Limit recursion depth to avoid infinite loops
),dataset as (-- Final selection
SELECT *
, ST_ClusterKMeans(ST_Transform("centerPoint", 24313), GREATEST(1, CEIL("cluster_size" / 20.0)::int))
OVER (PARTITION BY "areaID") AS "new_cluster_id" -- Re-cluster based on size
FROM recursive_clusters
WHERE "recursion_depth" <= 100
)SELECT "areaID"
, new_cluster_id
, count(*)
FROM dataset
GROUP BY 1,2;
I observed that the counts for the majority of the rows are significantly greater than 20
, specifically returning values like 900
, 100
, and 200
.
This discrepancy raises some concerns, as such high counts may indicate potential issues within the recursive logic utilized in the query. I’m particularly interested in delving deeper into whether this recursive approach is functioning as intended or if it might be introducing inaccuracies into the results.
Additionally, I would appreciate insights on alternative methods or strategies that could achieve the same objectives, potentially yielding more reliable and accurate counts.
order by
in a window function call implies range between unbounded preceding
and current row
frame. That means each of the shapes only sees itself and shapes with lower ID. Even if that doesn't cause problems, it needlessly complicates the operation. Remove the order by
from the window definition.recursion_depth
and that's it - you could've done that without recursive
, joining to generate_series(2,99,1)
.ST_Transform()
slows things down and disables index support.My bet is you overlooked that in point 2 above, you multiplied your set by cloning the same shapes in the large clusters, over and over, 98 times, for no obvious reason. It'll come out larger than it came in, and overall count(*)
as well as individual cluster size will go up accordingly.
You can count(*)
your input and compare that to a total count(*)
on your output. In this test I generated 1005 points within the boundaries of SRID 24313 (in and around Pakistan) and your query turned those 1005
points into 95847
.
Here's what I think you tried to do:
demo at db<>fiddle
DROP TABLE IF EXISTS clustered_points;
CREATE TABLE clustered_points AS
WITH clustered_data AS (-- Step 1: Perform initial clustering
SELECT pma.*
, ST_ClusterKMeans(ST_Transform("centerPoint", 24313), 1, 2e4)
OVER (PARTITION BY "areaID") AS "nta_cluster_id" -- Initial clustering
FROM "potential_missed_areas" pma
WHERE pma."campaignid" = -1
AND pma."status" = 'approved'
),cluster_sizes AS (-- Step 2: Calculate cluster sizes
SELECT "areaID"
, "nta_cluster_id"
, COUNT(*)::int AS "cluster_size"
FROM clustered_data
GROUP BY 1,2
),dataset as (-- Final selection
SELECT *
, ST_ClusterKMeans( ST_Transform("centerPoint", 24313)
,GREATEST(1,CEIL("cluster_size" / 20.0))::int)
OVER (PARTITION BY "areaID", "nta_cluster_id") AS "new_cluster_id" -- Re-cluster based on size
FROM clustered_data
JOIN cluster_sizes USING("areaID","nta_cluster_id")
)SELECT *
, format( '"areaID":%s,"nta_cluster_id":%s,"new_cluster_id":%s'
,"areaID","nta_cluster_id","new_cluster_id") as "full_cluster_identifier"
, rank()over(order by "areaID","nta_cluster_id","new_cluster_id") as "final_cluster_id"
FROM dataset;
Also useful for inspection:
drop table if exists cluster_shapes;
create table cluster_shapes as
select min("full_cluster_identifier") as "full_cluster_identifier"
, min("final_cluster_id") as "final_cluster_id"
, min("cluster_size") as "cluster_size"
--, st_optimalalphashape(st_collect("centerPoint")) as geom
, st_convexhull(st_collect("centerPoint")) as geom
from clustered_points
group by "final_cluster_id"
having min("cluster_size")>5;
This doesn't multiply the set and performs your two-phase cluster correctly. What I think you overlooked is that you partitioned by "areaID"
in the initial step, but then you forgot to involve it in the group by
when counting the sizes in cluster_sizes
. This lead to different clusters in different areas being counted together because their numbers are re-used in different areas.
Cluster 1 in area 1 has the same number, but isn't related to cluster 1 in area 5. If you only group by "nta_cluster_id"
, it'll mix them together and the next step will see incorrect sizes.
Here are some clusters formed using that on 200k random points: On a surface area this large with uniformally distributed points, 20km radius is too short to form sizeable clusters, and I really needed a lot on input for even a few to connect.