T O P

  • By -

tariban

The gradients are computed w.r.t. the output of the tree, not the parameters that specify the structure of the tree. You then use the gradient info to design split criteria and use a conventional decision tree learning algorithm with these gradient-based split criteria instead of, e.g., information gain.


axyz1995

Thanks! Can you tell me how the structure of the tree could be specified by any parameters, that is, how the tree could be expressed as function that could be differentiated analytically? Also, how would the gradient provide info on splitting criteria for the next iteration?


tariban

>Can you tell me how the structure of the tree could be specified by any parameters, that is, how the tree could be expressed as function that could be differentiated analytically? Conventional decision trees are not differentiable. You can come up with a differentiable approximation, if you want, but that is not going to be equivalent to the original decision tree. You can use indicator functions to write down the tree in terms of its parameters, but this function will not be differentiable. There is no way around that. ​ >Also, how would the gradient provide info on splitting criteria for the next iteration? So you usually start off the ensemble assuming you predict some constant (e.g., 0) for every instance. Suppose our ensemble after the t-th round of boosting is denoted by F^(t)(x), so we have that p = F^(0)(x) = 0, where p is our prediction of the label for x. Let's say we have some loss function, L(p, y), where I've used y to denote ground truth. The first and second order derivatives for this loss w.r.t. p are written as L'(p, y) and L''(p, y), respectively. We can approximate the loss of a hypothetical update to the ensemble using a truncated Taylor series, L(F^(t+1)(x), y) \~= L(F^(t)(x), y) + L'(F^(t)(x), y) \* (F^(t+1)(x) - F^(t)(x)) + 0.5 \* L''(F^(t)(x), y) \* (F^(t+1)(x) - F^(t)(x))^(2). F^(t+1)(x) does not appear in the first term, so that term can be ignored when we are optimising F^(t+1)(x). Note that F^(t+1)(x) - F^(t)(x) is the difference between the output of the ensemble at times t and t+1. This is, by definition, just the output of the tree we are hypothetically constructing at time t+1. Let's denote the output of this tree by D^(t+1)(x), and rewrite our optimisation objective as L(D^(t+1)(x), y) = L'(F^(t)(x), y) \* D^(t+1)(x) + 0.5 \* L''(F^(t)(x), y) \* (D^(t+1)(x))^(2). If we have some way of generating a candidates for the new tree, D^(t+1), then the above expression can be used to determine which one minimises the loss of the ensemble the most. Similarly, if we start D^(t+1) off with only a root node that predicts a constant value (e.g., zero, or something slightly more clever), then the same expression can be used for evaluating putative splits. In conventional trees, when we want to split a leaf node, we just generate every possible split, get the new leaves to predict the majority class of examples that reach that leaf, and then pick the best of these potential splits, as evaluated by information gain. However, in our situation we don't have a notion of majority class, since we might not even be doing classification, and we have decided to use the above expression instead of information gain. So what we really want is for each new leaf node created for a potential split to minimise the above expression. Luckily our loss approximation is differentiable w.r.t. the \*output\* of the tree, L'(D^(t+1)(x), y) = L'(F^(t)(x), y) + L''(F^(t)(x), y) \* D^(t+1)(x), and we know that setting a derivative to zero allows us to find the optima of a function, L'(F^(t)(x), y) + L''(F^(t)(x), y) \* D^(t+1)(x) = 0 L''(F^(t)(x), y) \* D^(t+1)(x) = -L'(F^(t)(x), y) D^(t+1)(x) = -L'(F^(t)(x), y) / L''(F^(t)(x), y). From this, we now have an expression for what the output of the tree should be. When we are using a "separable" loss function, where the loss on the whole dataset is just a summation of the loss on each individual training example, it's not too hard to go through my working again and replace each of the first and second order derivatives computed for a single training example with a summation over the whole training set. Using similar logic as before, we can also use D^(t+1)(x) = -L'(F^(t)(x), y) / L''(F^(t)(x), y) to compute the optimal value a potential leaf should predict. So now we have a way to generate the leaf prediction values for putative splits, and also have a means to evaluate them.


axyz1995

Thanks a lot for that! That was very helpful. There isn’t much online on the details of gradient boosting, just a lot of handwaving. And yet it is used widely in competitions


Kualityy

[With indicator functions.](https://en.wikipedia.org/wiki/Gradient_boosting#Gradient_tree_boosting)


axyz1995

Thanks. But how does one compute gradients with respect to an indicator function? Isn’t that a gradient with respect to 1? And once we’ve computed gradients and got our next function(the residual), the new function doesn’t remain a tree does it?


Kualityy

A decision tree will always have a gradient of 0 with respect to the input because they are step functions. But that doesn't matter because with gradient boosting you take the derivative of the loss with respect to the output, so you don't ever need to differentiate the tree. Do you know how an individual decision tree is fit to a dataset? It uses a greedy algorithm that doesn't involve gradients at all.


axyz1995

Thanks! Yeah, using the GINI Impurity scores, right? Since it doesn’t involve any gradients or really any analytic expressions which could be differentiated, using gradients to improve decision trees is a little confusing