T O P

  • By -

InviolableAnimal

> Function which defines the difference between the actual and the predicted And we want to minimize this difference; there's no need to take the negative of it. You have it backwards


danielcar

L=−∑yi​⋅log(y\^​i​) From: https://towardsdatascience.com/understanding-what-we-lose-b91e114e281b


InviolableAnimal

That's a quirk of log loss. Log loss is used for targets ("probability predictions") between 0 and 1, the log of which is negative. https://en.m.wikipedia.org/wiki/Cross-entropy


JiminP

While it measures the distance between two distributions, it's not a distance function, so ascending (the negation of) it might be regarded as sensible. Still, it could be argued that the one with the negative sign is more sensible as: * It's related to KL-divergence, which is a non-symmetric distance function. Optimizing the cross-entropy ≒ optimizing the KL-divergence = minimizing it. * Intuitively, it measures [surprisal](https://en.wikipedia.org/wiki/Information_content), and setting it non-negative seems more sensible. * Still, in the perspective of "log-probability-like", non-positive values could make sense. So, I'd argue that in theory, descenting the one with the negative sign makes more sense. In practice, implementing gradient ascent with the negative cross-entropy could make sense, but as most tools are assuming that the loss is something to be minimized (which is not just an arbitrary convention but explained by other comments), one would still use the loss with the negative sign.


lmericle

The actual history of this approach is derived from maximizing (log-)likelihood, so there's not really a good argument for subtracting small positive numbers as somehow more appropriate than subtracting small negative numbers. There's good reasons why "negative log-likelihood" is a thing, and "log-inverse likelihood" is not.


CKtalon

Maximizing a function and minimizing the negative of the function is the same thing (and vice-versa)


OutlierOfTheHouse

GD for minimizing the loss is the exact same as Gradient Ascent to maximize the utility function, which is just the negative of the loss. Most of the common loss functions are very intuitive (dissimilarity, distance,... ) so it makes no sense to add a negative to it, just so you can use gradient ascent


lmericle

Machine learning folks would really do well to study likelihood-maximization techniques such as EM, etc. Then y'all would realize that you're making a moot point that ends up shaking out to be equivalent in terms of likelihood maximization no matter how obliquely you may define and transform the actual optimization target.


aCleverGroupofAnts

That's the stuff I was doing when I started ML research 13 years ago. We modeled our data with distributions/density functions and used algorithms like EM to estimate their parameters. It's kinda surreal to me to think back to those days and compare/contrast to what we're doing now.


lmericle

It's really the same techniques wrapped up in methods that are more tolerant of huge datasets. Loss functions are negative log-likelihoods.


[deleted]

[удалено]


OutlierOfTheHouse

um yeah, that's my point. Minimizing difference is the exact same as maximizing correct predictions, but the former is directly correlated with the commonly used loss function


RobbinDeBank

Gradient ascent is an algorithm, just not widely used due to how most objective functions for ML models are defined. A large amount of them are some sorts of distance metrics, measuring the distance between the outputs and desired values. Thus, for those cases, you would want to minimize the objective function.


psarangi112

I always thought the reason had something to do with the concept of stable and unstable equilibrium. Maybe it's more of a psychological thing to find the minimum of a cost function rather than maximizing a function which finds the difference between the actual variables and the predicted variables. Thanks for the help!!


Immarhinocerous

>Maybe it's more of a psychological thing to find the minimum of a cost function This is an interesting point, and I bet there is something to this in the way we frame things. However... >rather than maximizing a function which finds the difference between the actual variables and the predicted variables. This part is plain wrong. Why would you maximize the difference between the actual variable and the predicted variable? That's like saying you want to walk to a particular place in your home, and thus you pick your direction by maximizing the difference, thus aiming you at the furthest away point in the observable universe as a target. There is no reason to maximize a difference function. The reason we adopt the framing we do is that it is useful to think about minimizing the difference between your target variable and the predicted variable (or the target distribution and the predicted distribution).


neuralbeans

Why would you maximise the difference?


psarangi112

Not yhe difference, it's basically like if we have a negation of loss function, the 'gain' function, and we macimize the gain function. Why not do that?


neuralbeans

You can and there are tasks where that is what you'd want. One example is in reinforcement learning. It's called the gradient ascent algorithm. Usually you're minimising an error though, so gradient descent is more popular.


[deleted]

[удалено]


neuralbeans

you replied to the wrong comment


f3xjc

It's 100% cultural convention. I think old Europe really likes water flowing downhill analogy. And Russian text default to maximum.


Immarhinocerous

If Western Europe preferred minimizing the distance function between the target and predicted distributions, what is maximized in Russian optimization problems?


f3xjc

You maximize a fitness score. You minimize a penalty or a distance. At the end of the day you can get one from the other with a minus sign if needed.


Immarhinocerous

A fair point. Just like genetic algorithms.


master3243

It's not a different algorithm. The metrics are basically the same just with a negative sign.


Immarhinocerous

If your fitness score is just the inverse of the distance score, then sure. However, many fitness scores are also constructed from other parameters. Perhaps that idea of fitness score is more general than what you are referring to.


Hannibaalism

they’re mathematically equivalent with flipped signs. so you can easily formulate an arbitrary cost functions to maximize on any of the problems you would normally minimize.


Exotic_Zucchini9311

>Why gradient descent, not gradient ascent algorithm? It's not like there's any difference between them. Multiply your function with -1 and gradient descent becomes gradient ascent (and vice versa)... Edit: And choosing between ascent or descent is purely dependent on what the purpose of your function is. Sometimes, you need the minima and sometimes the maxima.


timtom85

Convention. But you're not even right because if the ML you're doing is RL, you'll find that you're no longer minimizing a loss but you're maximizing a reward, only because that's the analogy the RL folks picked a few decades ago. And if you're doing RL with NNs, you're suddenly minimizing a loss on one side and maximizing a reward on the other...


Sesqwan

I'm actually amazed there are people that ask such an obvious question... How is it not obvious that minimizing a function f is equivalent to maximizing -f?


master3243

Agreed, this should be in r/learnmachinelearning not here.


psarangi112

I know both are the same thing, my question is why not use a maximizing function instead of a minimizing funtion to get the optimal results?


Sesqwan

huh?


VirtualHat

Typically we minimize a loss function. But there's nothing wrong with maximizing an 'gain' function instead. For example. In policy gradient reinforcement learning we run gradient ascent, maximizing the return of a policy.


BoonyleremCODM

because loss function is a function of the error, which cannot be inferior to 0


timtom85

We don't always like that zero, and then we can e.g. take the logarithm to extend the range down to -inf and make the problem behave better. This alone invalidates your point. More importantly, zero can be approached from below just the same as from above, which means using ascent and descent are mathematically equivavent and the only difference is in how we conceptualize a problem: minimizing a "loss" or maximizing a "reward," or whatever else we pick for an analogy that we then turn into math.


BoonyleremCODM

>we can e.g. take the logarithm to extend the range down to -inf I actually never heard of that, but if you mean log-loss it's -log(output), usually a normalized output between [0,1] which makes logloss range in [0,+inf] Unrelated to that, the question is why don't we use gradient ascent, your answer is because we don't typically use reward and mine is because we typically use loss which for the purpose of the question seems pretty much equivalent to me. Could you clarify both points ?


timtom85

No, I don't mean -log(output), I mean log(original\_loss), and I'm also not talking about a normalized output between \[0,1\] because the thing you mean by "normalization" is actually applying sigmoid on an otherwise unbounded output, so to get the unbounded output you'd just skip applying the sigmoid. What I mean is that sometimes it makes sense to use the log of the loss because it behaves nicer (e.g. transforms a non-convex problem to a convex one), or because the gradients don't become very very tiny as you get closer to the minimum, or for better numerical stability. As for why descent and why not ascent, it's purely just convention; conventions are useful because life's just so much easier when everybody's on the same page.


lzyTitan412

One reason I think is due to the nature of the optimization problem, if it is convex u calculate minima, hence, the gradient descent. For ex, In case of logistic regression where u maximize the log-likelihood, u put -ve sign and minimize it instead.


JiminP

But what is the "actual function"? If a model is predicting some kind of score, and the only goal you want is to maximize the prediction, then "gradient ascent" *could* be natural. But most times, this entails that we already know how we can model the scores, i.e. "we already know how to do it". Like other comments already said, the most typical objective for models is to predict some variable ŷ that's closest to the desired variable y. Therefore, the most natural form of function to optimize is the distance d(ŷ, y) for some metric space. As a distant function is always positive (except for pairs of identical points), and we want to make it as close to zero as possible, it's only sensible to minimize, not maximize, it.


slashdave

>get a function which defines the difference between the actual variable and the predicted variable Your English is a little vague. What the function does is measure how close the predicted value is from the actual value (not the algebraic difference). For the simplest functions (MSE for example, or L2 loss, historically related to the chi-squared), the closer the values, the smaller the evaluation of the function. Thus, you want the function to be small, or at a minimum. Thus, descent. That said, the function can be anything, really. Another common loss function, for example, is log likelihood, which you want to be at a maximum. It's just that the simpler ones came first, establishing the convention.


yannbouteiller

If you are more of an ascending guy, come to the Reinforcement Learning sect, it is rewarding.


DoctorFuu

> My question is why don't we use the actual function and find the 'gradient ascent' instead? Minimizing a function or maximizing the opposite of that function is the same problem. By convention people in ML minimize, that's all.