pythongroupingshapesanalysissimilarity

Grouping landmak vectors by similarity (Python or R) - which is the simplest solution?


Pardon my verbosity but I think that I've much to explain. My English leaves something to desire. I'm 66 years old, from Italy, with some experience in programming, but I never ventured in the realm of digital morphology.

I find tons of documentation and examples of complex and - in my case - overkilling libraries/packages that address the subject of shape classifiers. Since the early 2000's I've been occasionaly reading, and only partially understanding!, works such as those by Norman McLeod about eigenshapes (the famous Palaeomath-101 series of lectures). My most recent reading is Salili-James et al. (2022) .

My objective is not particularly ambitious. Out of personal interest, I would like to understand which simplest kind of algorithm may succeed in grouping meaningfully the outlines of the seashells of closely related and very similar species.

Right now, I just gained the ability to start from a photo of the shell and generate 100-elements 1D arrays representing as many euclidean distances from the centroid to evenly spaced points on the outer edge contour. In the process, all the images are resized to the same size, so that the detection of equal shapes is not hampered by differences in specimen size.

Shell outline with 100 evenly-spaced points

The 100-elements 1D arrays can be represented as curves, that can be considered a faithful representation of the shape. Please be patient and check the attached pictures, that should be self-explanatory (I'm not actually sure that my language is appropriate).

Vector with 100 distances from the centroid - a curve that resumes the typical shellshape

Before beginning my systematic work of photographing the seashells and saving the 100-values vectors, I want to decide what's next. Again, I need some embryonic form of unsupervised classifier that, in my hopes, could amount to something very basic, the least-effort solution that could result in some form of "grouping by similarity" of my curves (you have seen that I'm less than rigorous in my explanations...).

So, I need two separate advices:

  1. comparing... what?? Should I compare each curve with the 99 other curves (and the number of results would explode!) or should I create the second term of comparison? In other words, if one term of comparison is provided by each curve in my set, should I average all the curves and build a "mean seashell", then use it as second term of comparison against all my curves?

  2. comparing... how? I can work in Python as well as in RStudio environment. All the libraries/packages that I find seem overkilling to me. Until I find a better idea, I would start with something elementary (e.g., sum of squares of index-by-index differences along all the length of the curves). Which is the simplest programmatic solution to generate "similarity values" that could be used to separate the curves in discrete groups by similarity?

Many thanks for your patience and all the best,

Cesare


Solution

  • Your english is fine, as far as I'm concerned.

    That is an interesting problem. I don't necessarily have a definite answer, but I do have a few ideas.

    What you want to compare depends on the classification algorithm you choose, which in itself is dependant on your data. For instance:

    Many algorithms are implemented in the sckit-learn package, if that is something that you would like to try. I would at least encourage you to look at the documentation, since quite a few algorithms are presented there with examples.

    Good luck on your seashell classification journey !