I'm hoping this is the correct place to post - if not, I am willing to change to SO.
In any case, I am using MDS to help me find a 2-D representation of a dataset. Essentially, these are pKa values of amino acid residues across many years' worth of protein data - decimal numbers of the same scale, at its core. There are many positions (~600 rows), and there are many years (~12 columns).
My question is this: is the correct input to MDS the data matrix (years vs positions), or can I put in the correlation matrix (year vs year)? I ask because the API docs conflict with the written description.
API docs say data matrix: http://scikit-learn.org/stable/modules/generated/sklearn.manifold.MDS.html#sklearn.manifold.MDS (i.e. n_samples, n_features).
Written description says "the input similarity matrix": http://scikit-learn.org/stable/modules/manifold.html
If you pass dissimilarity='euclidean'
to the initial estimator (or by default), it will take a data matrix and compute the Euclidean distance matrix for you.
If you pass dissimilarity='precomputed'
, it takes a dissimilarity matrix.
The docs are indeed not super-clear on this, though; I'm sure a pull request adding a brief note to the description of the X
argument, and clarifying that 'euclidean'
is the default (I had to check the source), would be accepted.