I mused about how hard it would be to create an Excel Lambda function to deliver K-means clusters. I built a one-dimensional version of it with only a small investment of time.
One dimensional versions are, by themselves, quite useful in solving clustering problems. I used a 1D version for example to examine trip journey data of people walking between buildings on a campus (using access control data). The management of this company suspected that when employees walk between buildings while on duty, they were using the journeys to "go shopping", "take long breaks", etc. and I thought I could prove to what extent this really existed. The data quickly exposed interesting details - e.g., don't try to change buildings just after lunch because the elevators are full. It also exposed "clusters" of transit times:
This proved the existence of the managers' suspicions and put measurements onto it. Unfortunately, it offered at least one major reason this facility had the lowest productivity in the world.
In another 1D case, I clustered invoicing amounts to build some kind of classifier - more like binning for a histogram.
So, here is my Lambda based, non-VBA one-dimensional K-means function:
=LAMBDA( dataArray, nodes, [changeThreshold], [kArray],
LET(
nSeq, SEQUENCE(, nodes),
n, IF(ISOMITTED(kArray), nSeq / (nodes + 1), kArray),
adjust, LAMBDA(array, nArray,
BYCOL(
IF(BYROW(ABS(array - nArray), LAMBDA(x, MATCH(MIN(x), x, 0))) = nSeq, array, ""),
LAMBDA(x, AVERAGE(x))
)
),
next_kArray, adjust(dataArray, n),
IF(
SUM(ABS(next_kArray - n)) <= IF(ISOMITTED(changeThreshold), 0.01, changeThreshold),
kArray,
KMEANS_1D( dataArray, nodes, changeThreshold, next_kArray )
)
)
Where the arguments are:
NOTES: This has no error handling, so user input errors, data with errors, convergence faults, etc. result in Value!. This only takes vertically presented data from a single column. If you want to to take tabular, row-wise data, change it, but if I put in all of these manipulations, the algorithm gets lost in the code and this is only for illustration. Also, I chose not to make a random initial selection of nodes because I feared the function would become volatile. Therefore, I used a simple even distribution (
nSeq / (nodes + 1)
).
Now, the challenge is to make a multi-dimensional version. It has proven not to be so trivial. I will offer it as an answer and ask if there are improvements or alternatives.
I will not mark a correct answer - this is just for sharing and collaboration.
I'm kinda unfamiliar with K-Means and tried to understand it by doing a little research. I used reduce to iterate:
=LET(xy,C3:D102,
k,3,
REDUCE(TAKE(xy,k),SEQUENCE(10),
LAMBDA(r,i,
LET(t,BYROW(xy,
LAMBDA(b,
LET(m,BYROW(((r-b)^2),
LAMBDA(n,SUM(n)))^0.5,
XMATCH(MIN(m),m)))),
MAP(SEQUENCE(k)+SEQUENCE(,COLUMNS(xy))*0,SEQUENCE(,COLUMNS(xy))+SEQUENCE(k)*0,
LAMBDA(x,y,
AVERAGE(CHOOSECOLS(FILTER(xy,t=x),y))))))))
It first takes the first k
rows (I figured there's no need for volatile random choice); this is the starting point.
From there on it checks the distance from each row in you range to the starting point k
-rows values column-wise (to the power 2, then summed and then to the power a half).
The values are assigned a nearest k
value and all values nearing a k
-value are averaged to become the new k
-values.
I used iteration times 10.