python scikit-learn multi-dimensional-scaling

How to plot a MDS from a similarity matrix?

I'm using a similarity matrix with values between 0 and 1 (1 means that the elements are equals), and I'm trying to plot a MDS with python and scikit-learn.

I found multiple examples, but I'm not sure about what to give as an input to mds.fit().

For now, my data looks like that (file.csv) :

  ;  A  ;  B  ;  C  ;  D  ; E
A ; 1   ; 0.1 ; 0.2 ; 0.5 ; 0.2
B ; 0.1 ; 1   ; 0.3 ; 1   ; 0
C ; 0.2 ; 0.3 ; 1   ; 0.8 ; 0.6
D ; 0.5 ; 1   ; 0.8 ; 1   ; 0.2
E ; 0.2 ; 0   ; 0.6 ; 0.2 ; 1

I'm currently using this code :

import pandas
from sklearn import manifold
import matplotlib.pyplot as plt

data = pandas.read_table("file.csv", ";", header=0, index_col=0)

mds = manifold.MDS(n_components=2, random_state=1, dissimilarity="precomputed")
mds.fit(data)
points = mds.embedding_

# Prepare axes
ax = plt.axes([0,0,2,2])
ax.set_aspect(aspect='equal')

# Plot points
plt.scatter(points[:,0], points[:,1], color='silver', s=150)
# Add labels
for i in range(data.shape[0]):
    ax.annotate(data.index[i], (points[i,0], points[i,1]), color='blue')

#plt.show() # Open display and show at screen
plt.savefig('out.png', format='png', bbox_inches='tight') # PNG
#plt.savefig('out.jpg', format='jpg', bbox_inches='tight') # JPG

I'm unsure about what sklearn is doing. I read a lot of examples where people are using "dissimilarity matrix" with 0 in the middles (instead of 1).

Should I make a transformation ? Or not ? If yes, which transformation should be done ? (I read there that a simple substraction is enough... but other methods exist... I'm a bit lost :( )
Does sklearn and MDS automatically understand the input ? (as a similarity or dissimilarity matrix with the 0 or 1 in the middle ?) Or does it use a distance matrix ? (In this case, how to obtain it from a similarity matrix ?)
In this link, they say the similarity are between 1 and -1... I'm using similarities between 0 and 1... I suppose I should transform my data? Which transformation should be used?

Solution

I did a comparison with XLSTAT (an excel extension) in order to try a lot of scenarios and compare how to do what.

First : my input matrix was a "similarity" matrix, because I could interpret it as: "A and A are 100% equal". As MDS is taking a dissimilarity matrix as input, I must apply a transformation.

In the litterature Ricco Rakotomalala's french course on data science (p 208-209), the easy way is to substract the maximum value to each cell (make a "1 - cell" operation). So you can easily make a python program, or (as I keep a trace of each matrix) an AWK pre-processing program :

similarity-to-dissimilarity-simple.awk

# We keep the tags around the CSV matrix
# X ; Word1 ; Word2 ; ...
# Header
NR == 1 {
    # First column is just "X" (or space)
    printf("%s", "X");

    # For each column, print the word
    for (i = 2; i <= NF; i++)
    {
    col = $i;
    printf("%s%s", OFS, col);
    }

    # End of line
    printf("\n");
}

# Other lines are processed
# WordN ; 1 ; 0.5 ; 0.2 ; ...
NR != 1 {
    # First column is the word/tag
    col = $1;
    printf("%s", col);

    # For each column, process the number
    for (i = 2; i <= NF; i++)
    {
    # dissimilarity = (1 - similarity)
    NUM = $i;
    VAL = 1 - NUM;
    printf("%s%s", OFS, VAL);
    }

    printf("\n");
}

It can be called using the command :

awk -F ";" -v OFS=";" -f similarity-to-dissimilarity-simple.awk input.csv > output-simple.csv

A more complex way of calculating (I can't find back the reference, sorry :( ) is based on another transformation on each cell :

$(sii + si'i' - 2 * sii')^1/2$

This method seems to be perfectly adapted if the diagonal does not contain the same value (I saw there a co-occurrence matrix... it should apply to his cas). In my case, as the diagonal is ALWAYS full of 1, I reduced it to :

$(2 - 2 * sii')^1/2$

The AWK program to make this transformation (I implemented the simplified one, because of my data) is therefore :

similarity-to-dissimilarity-complex.awk

# Header
# X ; Word1 ; Word2 ; ...
NR == 1 {
    # First column is just "X" (or space)
    printf("%s", "X");

    # For each column, print the word
    for (i = 2; i <= NF; i++)
    {
    col = $i;
    printf("%s%s", OFS, col);
    }

    # End of line
    printf("\n");
}

# Other lines are processed
# WordN ; 1 ; 0.5 ; 0.2 ; ...
NR != 1 {
    # First column is the word
    col = $1;
    printf("%s", col);

    # For each column, process the number
    for (i = 2; i <= NF; i++)
    {
    # dissimilarity = (2 - 2 * similarity)^-1/2
    NUM = $i;
    VAL = sqrt(2 - 2 * NUM);
    printf("%s%s", OFS, VAL);
    }

    printf("\n");
}

And you can call it with this command :

awk -F ";" -v OFS=";" -f similarity-to-dissimilarity-complex.awk input.csv > output-complex.csv

When I used the Kruskal's stress to check which version was better... in my case, the simple similarity to dissimilarity (1 - cell) was the best (I kept a stress between 0,34 and 0,32... which is not good... where the complex shows bigger values than 0,34, which is worse).