pythonvectordecision-treeorthogonal

Orthogonality between distributions


I'm implementing a decision tree algorithm and trying to use Orthogonality to measure the quality of a split. My understanding is that I calculate Orthogonality as:

1−cosθ(Pi,Pj)

Where i is the partition of data before the split, and j is the partition after the split. Pi and Pj are vectors of probabilities for each target value in each partition.

I've implemented the following but I'm not sure if I'm interpreting this correctly. I've got 6 classes and vector 1 has 66% in class 1, 33% in class 2 and none in the remaining classes. Vectors 2 and 3 have the same distribution (40%, 10%,10%,20%,10%,10%)

import numpy as np

def calculate_orthogonality(vector_1, vector_2):

    dot_product = np.dot(vector_1, vector_2)
    orthogonality = 1 - np.cos(dot_product)

    return orthogonality

vector1 = [0.6666666666666666, 0.3333333333333333, 0, 0, 0, 0]
vector2 = [0.4, 0.1, 0.1, 0.2, 0.1, 0.1]
vector3 = [0.4, 0.1, 0.1, 0.2, 0.1, 0.1]

print(calculate_orthogonality(vector1,vector2))
print(calculate_orthogonality(vector1,vector3))
print(calculate_orthogonality(vector2,vector3))

0.0446635108744
0.0446635108744
0.028662025148

In particular I would have expected vector2 and 3 to return 0 i.e. they're identical and therefore parallel.

This leads me to believe I've misunderstood something here. Any ideas?

p.s. I have looked at other common measures such as gini impurity etc and they're fine but I've come across this as an alternative and I'm trying to measure it's effectiveness.

Cheers

David

EDIT:

Having found the following http://masongallo.github.io/machine/learning,/python/2016/07/29/cosine-similarity.html

it looks like I was way off in my understanding. If I use this implementation I get the following

import numpy as np

def cos_sim(a, b):
    """Takes 2 vectors a, b and returns the cosine similarity according
    to the definition of the dot product
    """
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    return dot_product / (norm_a * norm_b)

vector1 = [0.6666666666666666, 0.3333333333333333, 0, 0, 0, 0]
vector2 = [0.4, 0.1, 0.1, 0.2, 0.1, 0.1]
vector3 = [0.4, 0.1, 0.1, 0.2, 0.1, 0.1]

print(cos_sim(vector1,vector2))
print(cos_sim(vector1,vector3))
print(cos_sim(vector2,vector3))

0.821583836258
0.821583836258
1.0

Vectors 2 and 3 are highlighted as being the same. I need to understand a bit more about the process but I think this is correct.


Solution

  • Sorry for the delay - the answer was indeed to use the code as per the edit

    import numpy as np
    
    def cos_sim(a, b):
        """Takes 2 vectors a, b and returns the cosine similarity according
        to the definition of the dot product
        """
        dot_product = np.dot(a, b)
        norm_a = np.linalg.norm(a)
        norm_b = np.linalg.norm(b)
        return dot_product / (norm_a * norm_b)
    
    vector1 = [0.6666666666666666, 0.3333333333333333, 0, 0, 0, 0]
    vector2 = [0.4, 0.1, 0.1, 0.2, 0.1, 0.1]
    vector3 = [0.4, 0.1, 0.1, 0.2, 0.1, 0.1]
    
    print(cos_sim(vector1,vector2))
    print(cos_sim(vector1,vector3))
    print(cos_sim(vector2,vector3))
    
    0.821583836258
    0.821583836258
    1.0