machine-learning decision-tree machine-learning-model

Decision tree split implementation

I am doing this as a part of my university assignment, but I can't find any resources online on how to correctly implement this. I have read tons materials on metrics that define optimal set split (like Entropy, Gini and others), so I understand how we would choose an optimal value of feature to split learning set into left and right nodes.

However what I totally don't get is the complexity of implementation, considering we also have to choose optimal feature, which means that on each node to compute optimal value it would take O(n^2), which is bad considering real ML datasets are shaped about 10^2 x 10^6, this is really big in terms of computation cost.

Am I missing some kind of approach that could be used here to help reduce complexity?

I currently have this baseline implementation for choosing best feature and value to split on, but I really want to make it better:

    for f_idx in range(X_subset.shape[1]):
        sorted_values = X_subset.iloc[:, f_idx].sort_values()
        for v in sorted_values[self.min_samples_split - 1 : -self.min_samples_split + 1]:
            y_left, y_right = self.make_split_only_y(f_idx, v, X_subset, y_subset)
            if threshold is not None:
                G = calc_g(y_subset, y_left, y_right)
                if G < tr_G:
                    threshold = v
                    feature_idx = f_idx
                    tr_G = G
            else:
                threshold = v
                feature_idx = f_idx
                tr_G = G

    return feature_idx, threshold

Solution

So, since no one answered, here some stuff I found out.

Firstly, yes, this task is very computationaly intensive. However, several tricks may be used to reduce amount of splits you need to perform to "grow a tree".

This is especially important, since you don't really want a giant overfitted tree - it just doesn't has any value, what it is more important is to get weak model, which can be used with others in some sort of ensmebling teqnique.

As for the regularization tricks, here are couple of I used myself:

limit the maximum depth of tree
limit the minimal amount of items in node
limit the maximimum amount of leafes in tree
limit the minimum quiality change in split criteria after performing an optimal split

For algorithmic part, there is a way to build a tree a smart way. If you do it as in the code I posted earlier, time complexity will be around O(h * N^2 * D), where h is height of the tree. To work around this, there are several approaches, which I didn't personally code, but read about:

Use dynamic programming for accumulating of statistics per feature, so you don't have to recalculate them every split
Use data binning and bucket sort for O(n) sorting

Source of info: https://ml-handbook.ru/chapters/decision_tree/intro (use google translate, since website is in russian)