pythonscikit-learnmulti-dimensional-scaling

How to plot a MDS from a similarity matrix?


I'm using a similarity matrix with values between 0 and 1 (1 means that the elements are equals), and I'm trying to plot a MDS with python and scikit-learn.

I found multiple examples, but I'm not sure about what to give as an input to mds.fit().

For now, my data looks like that (file.csv) :

  ;  A  ;  B  ;  C  ;  D  ; E
A ; 1   ; 0.1 ; 0.2 ; 0.5 ; 0.2
B ; 0.1 ; 1   ; 0.3 ; 1   ; 0
C ; 0.2 ; 0.3 ; 1   ; 0.8 ; 0.6
D ; 0.5 ; 1   ; 0.8 ; 1   ; 0.2
E ; 0.2 ; 0   ; 0.6 ; 0.2 ; 1

I'm currently using this code :

import pandas
from sklearn import manifold
import matplotlib.pyplot as plt

data = pandas.read_table("file.csv", ";", header=0, index_col=0)

mds = manifold.MDS(n_components=2, random_state=1, dissimilarity="precomputed")
mds.fit(data)
points = mds.embedding_

# Prepare axes
ax = plt.axes([0,0,2,2])
ax.set_aspect(aspect='equal')

# Plot points
plt.scatter(points[:,0], points[:,1], color='silver', s=150)
# Add labels
for i in range(data.shape[0]):
    ax.annotate(data.index[i], (points[i,0], points[i,1]), color='blue')

#plt.show() # Open display and show at screen
plt.savefig('out.png', format='png', bbox_inches='tight') # PNG
#plt.savefig('out.jpg', format='jpg', bbox_inches='tight') # JPG

I'm unsure about what sklearn is doing. I read a lot of examples where people are using "dissimilarity matrix" with 0 in the middles (instead of 1).

  1. Should I make a transformation ? Or not ? If yes, which transformation should be done ? (I read there that a simple substraction is enough... but other methods exist... I'm a bit lost :( )

  2. Does sklearn and MDS automatically understand the input ? (as a similarity or dissimilarity matrix with the 0 or 1 in the middle ?) Or does it use a distance matrix ? (In this case, how to obtain it from a similarity matrix ?)

  3. In this link, they say the similarity are between 1 and -1... I'm using similarities between 0 and 1... I suppose I should transform my data? Which transformation should be used?


Solution

  • I did a comparison with XLSTAT (an excel extension) in order to try a lot of scenarios and compare how to do what.

    First : my input matrix was a "similarity" matrix, because I could interpret it as: "A and A are 100% equal". As MDS is taking a dissimilarity matrix as input, I must apply a transformation.

    1. In the litterature Ricco Rakotomalala's french course on data science (p 208-209), the easy way is to substract the maximum value to each cell (make a "1 - cell" operation). So you can easily make a python program, or (as I keep a trace of each matrix) an AWK pre-processing program :

    similarity-to-dissimilarity-simple.awk

    # We keep the tags around the CSV matrix
    # X ; Word1 ; Word2 ; ...
    # Header
    NR == 1 {
        # First column is just "X" (or space)
        printf("%s", "X");
    
        # For each column, print the word
        for (i = 2; i <= NF; i++)
        {
        col = $i;
        printf("%s%s", OFS, col);
        }
    
        # End of line
        printf("\n");
    }
    
    # Other lines are processed
    # WordN ; 1 ; 0.5 ; 0.2 ; ...
    NR != 1 {
        # First column is the word/tag
        col = $1;
        printf("%s", col);
    
        # For each column, process the number
        for (i = 2; i <= NF; i++)
        {
        # dissimilarity = (1 - similarity)
        NUM = $i;
        VAL = 1 - NUM;
        printf("%s%s", OFS, VAL);
        }
    
        printf("\n");
    }
    

    It can be called using the command :

    awk -F ";" -v OFS=";" -f similarity-to-dissimilarity-simple.awk input.csv > output-simple.csv
    
    1. A more complex way of calculating (I can't find back the reference, sorry :( ) is based on another transformation on each cell :

    (sii + si'i' - 2 * sii')^1/2

    This method seems to be perfectly adapted if the diagonal does not contain the same value (I saw there a co-occurrence matrix... it should apply to his cas). In my case, as the diagonal is ALWAYS full of 1, I reduced it to :

    (2 - 2 * sii')^1/2

    The AWK program to make this transformation (I implemented the simplified one, because of my data) is therefore :

    similarity-to-dissimilarity-complex.awk

    # Header
    # X ; Word1 ; Word2 ; ...
    NR == 1 {
        # First column is just "X" (or space)
        printf("%s", "X");
    
        # For each column, print the word
        for (i = 2; i <= NF; i++)
        {
        col = $i;
        printf("%s%s", OFS, col);
        }
    
        # End of line
        printf("\n");
    }
    
    # Other lines are processed
    # WordN ; 1 ; 0.5 ; 0.2 ; ...
    NR != 1 {
        # First column is the word
        col = $1;
        printf("%s", col);
    
        # For each column, process the number
        for (i = 2; i <= NF; i++)
        {
        # dissimilarity = (2 - 2 * similarity)^-1/2
        NUM = $i;
        VAL = sqrt(2 - 2 * NUM);
        printf("%s%s", OFS, VAL);
        }
    
        printf("\n");
    }
    

    And you can call it with this command :

    awk -F ";" -v OFS=";" -f similarity-to-dissimilarity-complex.awk input.csv > output-complex.csv
    

    When I used the Kruskal's stress to check which version was better... in my case, the simple similarity to dissimilarity (1 - cell) was the best (I kept a stress between 0,34 and 0,32... which is not good... where the complex shows bigger values than 0,34, which is worse).