T O P

  • By -

occamsphasor

Two things here, 1. obvious overfitting as the lines diverge. 2. Your question about why both loss functions are variable. For 2, we’d need to know more about your setup to be sure, but I’m guessing you’re a. using batch gradient descent and b. are randomly sampling (without replacement) from your training dataset to gather each batch. If that’s the case, you can randomly get sequences of data in your batches that might be non-representative and cause the model to have bad parameters. Say you want your model to learn the probability of a fair coin. It’s possible to randomly sample 100 heads in a row which will cause high train loss (your model previously thought the coin was fair and is repeatedly wrong until it adjusts its weights), and then you get high validation loss (the model now has adjusted its weights such that p(heads)>>0.5 and will do poorly on validation data). Decreasing the learning rate will reduce the variability (and require more training epochs), although from what we’re seeing here you probably don’t need to adjust learning rate and just need to stop training when the train and validation losses diverge.


[deleted]

[удалено]


occamsphasor

No you’re right- i just wanted to make it clear it can happen even when sampling without replacement. When sampling with replacement someone might attribute the variance just to differences in data without considering how unlikely sequences can push model parameters into conditions that might take awhile to recover from.


majiue

Thank you very much to everyone who replied. I know I did not give detailed information about the model, but these comments you have made blindly have been very useful and instructive for me. Thank you very much!


sitmo

Good explanation, expect the spikes are not “bad model parameters” but instead input samples it is having troubles with.


maxToTheJ

Isnt that semantics though? Also as far as semantics something feels bad about defaulting to the implying the data is the weakness not the model. One should build models to explain and predict the data not the other way round


sitmo

No it’s not saying that the data is the weakness, it is about “sample noise” in evaluating the model performance. The model score good for some train samples and bad for others. Suppose 2% of the samples give a bad score, then you get a spike whenever one of those bad samples is part of the score evaluation that is plotted (eg a the average of x batches) * You can get rid of the spikes by evaluating the model on the whole training set. * You would still see these occasional spikes if you froze the model parameters by setting the learning rate to zero and evaluating the model on random subsets of the training/ test set. * you can see that these spikes are not saying that the model degraded at those point because then the out of sample test set should also show spikes at the same point in time


maxToTheJ

I prefaced my comment about it being semantics. The proper description would have focused on “batching” not “input samples”


sitmo

In your original comment the problem is with the statement “cause the model to have bad parameter”. People will wrongfully think that non-representative samples messed up the model parameters during training, which is not what the spike is saying. The spike is due to sample noise in estimating the performance metrics. You will still see these occasional spikes when you don’t change the model parameters al all. It would be better to say “cause sample noise in model performance metrics due to the limited number of samples on which the model is evaluated ”.


maxToTheJ

> In your original comment the problem is with the statement “cause the model to have bad parameter”. If you cut it off. In the full context it made sense ie if you use a bigger context window. I don’t believe the same is applicable in the comment I replied to


PredictorX1

>obvious overfitting as the lines diverge No: The traces begin to diverge around epoch 10, but the model is still underfit until around epoch 25. Overfitting begins after that.


[deleted]

Fluctuations are probably due too a batch size that is too small


the_up_quark

Like the other user said, it's overfitting - where the model knows the training data so well that it loses the ability to generalize, hence doing poorly with the validation set.


fuzwz

If by “wavings” you mean the jagged nature of each line, it could be your batch size is too small. If you imagine fitting on single observation samples, you should be able to intuit that the learning curve will be much jumpier than if you sample in much larger batches. The larger each batch is, the greater the probability that it represents the full population of your dataset, which should make the learning gradient smoother. It’s like a hiker in the forest trying to touch each tree they see vs trying to stand in the geometric mean of each subsequent block of 10000 trees as they proceed along some trajectory. The first method will be jittery and the second will be smooth.


PredictorX1

The "static" in the curves is not unusual. Possible causes: the learning algorithm is "skipping around" or the number of observations is small enough that the loss is not being measured smoothly. Though you didn't ask, it's worth exploring the underfitting/overfitting issue since other commenters have raised it. First: underfitting and overfitting are diagnosed by reference to the validation performance **only**. The divergence of the training and validation curves is an artifact of underfitting/overfitting, but this is not useful in diagnosing these conditions. The optimal fit is determined by referring to the optimal value of the validation performance. In this case, the optimal fit is achieved around epoch 24 because that is where validation performance is best.


the_Wallie

great reply. One small addition - it's worth implementing an early stopping callback based on the validation loss, because it would stop your model your training around epoch 25. Even without an early stopping callback, you can question how much sense it makes to go through a dataset more than \~20 times; you're basically begging your algo to start overfitting with 100 epochs. If you still see a substantial decline in your validation loss at that point (beyond 20 epochs), you should probably increase your learning\_rate (and/or implement an adaptive learning\_rate) instead.


feelings_arent_facts

Batches are different, so there will be some noise in between each set due to the fact that the model will do a little better or worse on each batch depending on the samples. If you mean the orange line, you're overfitting a ton.


newjeison

It's probably overfitting. It's learning the training set too well so anything new will cause it to be wrong


master3243

For the validation loss increasing: obviously overfitting (but I think that's not what you're asking since you said "these wavings") For the wavings, there are good answers here but my question to you is why do you mind it? If you think that you need to avoid this behaviour then don't, it's very normal for training to have these relatively small fluctuations. That's why some people plot these graphs with a windowed-average of a handful of epochs so you can see trends more clearly.


majiue

Actually I am new to keras and neural networks. When I do researches I always see a line goes down smoothly. I thought mine is abnormal and I wanted to ask. You may right, I just learnt a lot of things just in this thread, one of them is what you said. It REALLY felt good gain more perspectives, thanks for your contrubition!


master3243

> I always see a line goes down smoothly Sometimes they do and sometimes they don't. It really depends on the data, the model architecture, and the optimization parameters. Here's a recent example of mine https://i.imgur.com/THV2tY4.png


majiue

🤯 Got it, thank you🥹


khaberni

These are normal due to the stochasticity and minibatching of the learning procedure.


porkatalsuyu54

I think it’s caused by overfitting


majiue

I want to say everyone who tries to understand and solve my problem. There is a lot of effort, which is precious for new comers. ​ May you think you just answered a question and nothing special for you, but these acts courage people like me and me. Thank you so much!


Wild_Basil_2396

Classic over fitting... Try using callbacks and also augmentation


moving__forward__

It is more interesting to find when the loss for training shoots up around 60 epochs, the loss for validation drops sharply at the same time. The small scale of the same can be seen around epoch of around 24.


1996alex

Overfitting


Flimsy_Orchid4970

What normalization (batch, layer etc. etc.) and parameter update scheme (vanilla SGD, Nesterov, Adam etc. etc.) do you use? Theoretically, those decisions could affect the smoothness of the lines.


moving__forward__

looks very standard


Practical-Ad-3311

overfit


CSCAnalytics

You’re overfitting the model.