I'm implementing a decision tree algorithm and trying to use Orthogonality to measure the quality of a split. My understanding is that I calculate Orthogonality as:
1−cosθ(Pi,Pj)
Where i is the partition of data before the split, and j is the partition after the split. Pi and Pj are vectors of probabilities for each target value in each partition.
I've implemented the following but I'm not sure if I'm interpreting this correctly. I've got 6 classes and vector 1 has 66% in class 1, 33% in class 2 and none in the remaining classes. Vectors 2 and 3 have the same distribution (40%, 10%,10%,20%,10%,10%)
import numpy as np
def calculate_orthogonality(vector_1, vector_2):
dot_product = np.dot(vector_1, vector_2)
orthogonality = 1 - np.cos(dot_product)
return orthogonality
vector1 = [0.6666666666666666, 0.3333333333333333, 0, 0, 0, 0]
vector2 = [0.4, 0.1, 0.1, 0.2, 0.1, 0.1]
vector3 = [0.4, 0.1, 0.1, 0.2, 0.1, 0.1]
print(calculate_orthogonality(vector1,vector2))
print(calculate_orthogonality(vector1,vector3))
print(calculate_orthogonality(vector2,vector3))
0.0446635108744
0.0446635108744
0.028662025148
In particular I would have expected vector2 and 3 to return 0 i.e. they're identical and therefore parallel.
This leads me to believe I've misunderstood something here. Any ideas?
p.s. I have looked at other common measures such as gini impurity etc and they're fine but I've come across this as an alternative and I'm trying to measure it's effectiveness.
Cheers
David
EDIT:
Having found the following http://masongallo.github.io/machine/learning,/python/2016/07/29/cosine-similarity.html
it looks like I was way off in my understanding. If I use this implementation I get the following
import numpy as np
def cos_sim(a, b):
"""Takes 2 vectors a, b and returns the cosine similarity according
to the definition of the dot product
"""
dot_product = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
return dot_product / (norm_a * norm_b)
vector1 = [0.6666666666666666, 0.3333333333333333, 0, 0, 0, 0]
vector2 = [0.4, 0.1, 0.1, 0.2, 0.1, 0.1]
vector3 = [0.4, 0.1, 0.1, 0.2, 0.1, 0.1]
print(cos_sim(vector1,vector2))
print(cos_sim(vector1,vector3))
print(cos_sim(vector2,vector3))
0.821583836258
0.821583836258
1.0
Vectors 2 and 3 are highlighted as being the same. I need to understand a bit more about the process but I think this is correct.
Sorry for the delay - the answer was indeed to use the code as per the edit
import numpy as np
def cos_sim(a, b):
"""Takes 2 vectors a, b and returns the cosine similarity according
to the definition of the dot product
"""
dot_product = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
return dot_product / (norm_a * norm_b)
vector1 = [0.6666666666666666, 0.3333333333333333, 0, 0, 0, 0]
vector2 = [0.4, 0.1, 0.1, 0.2, 0.1, 0.1]
vector3 = [0.4, 0.1, 0.1, 0.2, 0.1, 0.1]
print(cos_sim(vector1,vector2))
print(cos_sim(vector1,vector3))
print(cos_sim(vector2,vector3))
0.821583836258
0.821583836258
1.0