decision-treeid3entropyinformation-gain

Calculating the entropy of a specific attribute?


This is super simple but I'm learning about decision trees and the ID3 algorithm. I found a website that's very helpful and I was following everything about entropy and information gain until I got to

this point on the page.

I don't understand how the entropy for each individual attribute (sunny, windy, rainy) is calculated--specifically, how p-sub-i is calculated. It seems different than the way it is calculated for Entropy(S). Can anyone explain the process behind this calculation?


Solution

  • To split a node into two different child nodes, one method consists splitting the node according to the variable that can maximise your information gain. When you reach a pure leaf node, the information gain equals 0 (because you can't gain any information by splitting a node containing only one variable - logic).

    In your example Entropy(S) = 1.571 is your current entropy - the one you have before splitting. Let's call it HBase. Then you compute the entropy depending on several splittable parameters. To get your Information Gain, you substract the entropy of your child nodes to HBase -> gain = Hbase - child1NumRows/numOfRows*entropyChild1 - child2NumRows/numOfRows*entropyChild2

    def GetEntropy(dataSet):
        results = ResultsCounts(dataSet)
        h = 0.0   #h => entropy
    
        for i in results.keys():
            p = float(results[i]) / NbRows(dataSet)
            h = h - p * math.log2(p)
        return h
    
    def GetInformationGain(dataSet, currentH, child1, child2):
        p = float(NbRows(child1))/NbRows(dataSet)
        gain = currentH - p*GetEntropy(child1) - (1 - p)*GetEntropy(child2)
        return gain
    

    The objective is to get the best of all Information Gains!

    Hope this helps! :)