When calling Hierarchical clustering from WEKA (I am using IKVM from C#, but I don't believe that it is important, answer can be in either language), there is an option to generate the dendrogram in Newick format, but when trying to parse it, I need to identify leaves and link each leave to one datum (vector) in the input.
For example, the input arff is:
@RELATION points
@ATTRIBUTE x REAL
@ATTRIBUTE y REAL
@DATA
1.0,2.0
3.0,1.0
1.0,3.0
2.0,1.0
I would get the following dendrogram in Newick format:
((2.0:1,3.0:1):1.49661,(1.0:1,1.0:1):1.49661)
Where it is not clear how points are identified (the first branch has 2 and 3, but the second branch has 1 and 1, but it is not clear which one is which).
Is there a way to change the way this output is represented, or to add an extra unique attribute identify datums in a better way in the Newick output?
Found the solution, it might not work with all distance functions, but it works with the default config of Weka Hierarchical Clustering: The solution is just to add an extra string attribute at the end, which seems to be ignored in all calculations, this can contain a unique identification of the row or vector, this will be used by WEKA to output the final graph (Newick dendrogram).
Example ARFF:
@RELATION points
@ATTRIBUTE x REAL
@ATTRIBUTE y REAL
@ATTRIBUTE id STRING
@DATA
1,5,100
2,6,200
3,5,300
This will result in the following Newick:
(((100:1.41421,200:1.41421):-0.05358,300:1.36064):0.441,400:1.80164)
And when ignoring the last attribute, this will result in the same exact clusters, but with a different naming for the leaves:
(((5.0:1.41421,6.0:1.41421):-0.05358,5.0:1.36064):0.441,6.0:1.80164)
Which is ambiguous.