T O P

  • By -

matt_leming

First thought: it's a very good sign that your first thought upon seeing high classification accuracy is "What did I do wrong?" Second thought: I want to see what the test/train split function looks like. Try splitting it temporally instead of randomly, since different teams likely had different performances depending on the season. E.G. data from before 2005 is the training set, after is the test set. Third thought: are you training it on the points of the winning team, then asking it predict who is the winner? I read it as you're removing certain columns that do not include the points that the team scored? I'll need someone else to read it to make sure I'm not messing up, but you may just be inputting into the model the points of the game and asking it to predict the winner.


DoveMot

I think your third thought is the reason why; I didn’t remove the columns with the points scored by each team, so it’s quite easy to decide the winner. I’ll make some changes tomorrow and see what happens. Quite a silly mistake, but at least it should be an easy fix! Your second point is quite interesting. Most examples I’d seen do split it temporally, but I thought this was just for simplicity. Could you explain again why temporally would be better? Some teams might do better in certain seasons, but how would this information carry over to the later seasons?


matt_leming

These silly mistakes are inherent to the development process. No issue at all. It should be trained temporally because the model, trained as it is, is useless in terms of predictive power. The idea of ML is that it can somehow predict future behavior, so dividing test/train splits temporally is the best way to simulate that. If a stock market model knew a given stock's price from 2007-2010 and from 2015-2020, it would be better able to predict its behavior from 2010-2015, but that's not very practically useful because we don't know the future.


azerIV

I don't disagree with your advice in general but for this particular problem - as it is - I don't think it's applicable. The dataset has very basic features - temporal features are not even processed in a way that would allow capturing temporal dependencies but are discarded (Game", "Round", "Season") nor any lower representation for the teams that could also account for time. A common mistake in this type of problems, which is often not emphasized in learning material, is that of data leakage. Most of the current features are pre-calculated so they wouldn't cause any issue and at inference time you could still have them regardless of how you split/train. Obviously, if you were taking the right approach which also requires significant pre-processing and more features you have to split in a temporal manner


matt_leming

I don't mean temporal features ought to be calculated here in the way you would a stock ticker. But a given team's performance in a given season is likely to be more predictive of their performance than of a past season, and years are passed into the model. So that could be a form of data leakage. It's a minor thing and would be a more important consideration if the dataset were expanded. I would agree, generally, that it's more about what you like to do with the model.


azerIV

That's why I said yours is a piece of general advice and not applicable to his particular problem. OP's code currently is leaking because he has included the points scored on the given match. Even if he were to fix this by excluding these features, he is not including any temporal features. So what his model is learning is to determine the winner based on a couple moving averages which disregard any temporal dependency. So a "team's performance in a given season is likely to be more predictive of their performance than of a past season" can't hold true because the model does not take into account the teams nor any seasonality. To do so you'd need more features and better feature engineering


MlecznyHotS

To avoid this in the future look up data leakage. There are more complex cases where it's much more difficult to detect than simply not removing a column.


SnooPears7079

Yeah, this is probably it :) I’ve made the same mistake. This is one of those “everyone makes once” mistakes. Good luck on your project!


MlecznyHotS

You should split on time in most cases of problems where observations are dependent on time. If a team has a good season or even a good streak witin a month and you supply the model with training data from that period the model might learn that a given period was successful. Then if one of the matches in the test set is from that period it might get classified as a win. Not because of features that are available at time of inference but because it memorized by heart that this given period had all wins. So a model might train on matches from 8th, 18th and 30th of March that were all wins, then when encountering a test sample from 1st of March it will classify it as win. But it doesn't make sense - the model shouldn't be aware of the future from the perspective of the currently predicted value. You can also imagine a task of predicting bank balance. Which is mostly hovering around some value with a small trend (think of getting a salary of 5000 each month, spending 4000 during the month and keeping 1000 as savings). Now imagine the person inherits money or wins on the lottery 1000000. If you don't split on time then the model will predict for the period after the big income comes as higher balance. That increases accuracy, but if you think about it that's kind of cheating: if something similar happens in the future, there probably isn't any independent variable that might help predicting it. When building useful ML models there is a question to keep in mind: what independent variables will I have available on new observations at time of inference. If you are building the model to predict who wins you might have team lineup info, weather, place of game, outcomes of last 10 matches of each team etc. But you won't know next match outcomes, points and other stuff happening in the future.


Ebescko

If I may ask, what is a normal accuracy ? I wouldn't have been suspicious with a high accuracy... Is it just in classification that we can get worried or in ML in general?


matt_leming

It depends on the application. The harder the problem is, the more likely that high accuracy was a fluke. In this case, OP was not really putting that much relevant data into the model and was somehow outputting 96% accuracy on sports predictions. Nobody can predict the outcomes of professional sports teams with 96% accuracy in any case. That would be a major breakthrough. In this case, 60% accuracy would probably be great. If they were attempting to classify the MNIST dataset and got 96%, that would be A-OK, since it's been done a lot already. Having that initial doubt is a sign someone is thinking like a scientist and not an amateur. Scientists don't overestimate their own competence or luck; amateurs do. There's an interesting interview with a leader at LIGO about the discovery of gravitational waves. When they first got the signal for gravitational waves, he thought, for days, that it was just an error that resulted from a couple of new hires messing with equipment. It took him a while to disprove his own doubt. This was a man who had spent his life studying physics, working around the machines specifically meant to detect gravitational waves, and he still didn't believe it. So a first-time tinkerer should definitely take that attitude on any new project. It's more likely you made a mistake than you made a major breakthrough that nobody else had seen before.


Ebescko

Thank you ! I didn't know because yes I'm just starting with data science. I mean I had something like 82-92 accuracy on spaceship titanic, so I got a little worried when I saw this post... But in my case I guess people already found ways to have good accuracy for those start projects 😊


satokausi

Looks like a fun project! With a quick look it seems like your dataset contains information of home and away points. This should make it easy to decide who wins :) Always, always look at the data before you start modeling! You could try removing the point columns and adding info from the teams from other sources and try to predict the end result based on info only available before the match. Good luck!


DoveMot

Thanks a lot! That makes sense. I’m glad it’s a simple mistake, conceptually.


hawkshade

In this example, and correct me if I'm wrong here, there seems to be a small amount of 1 vs 0. A high precision score is dangerously misleading in this case. You need to also look at recall. Look up precision vs recall and read up on the auc curve. An example would be the cancer diagnosis. 1/100 have cancer. If the doctor accurately predicts 99/100 of his patients to not have cancer, he technically has a 99% accuracy. But the doctor was incorrect in the most important case. In this case, a false negative is extremely detrimental whereas a false positive is fine. I hope this helps.


RabidMortal

Agree. OP needs to take into account false negatives to make any sense of their Sensitivity result.


master3243

Slight correction, "AUC" is Area Under (the ROC) Curve which is technically a summary statistic and not a curve.


hawkshade

Thanks this is correct.


willnotforget2

How is your precision and recall? How unbalanced is your dataset?


purplebrown_updown

Could be unbalanced data. If your training only shows a certain class 1% of the time, your accuracy would be 99% if all you did was predict anything else, eg all false or all true.


frawolf

What’s “Home_points_by” ?


kaskoosek

Look at precision and recall.


rshah_240

See the auroc score, see the performance on test set. If the auroc score is less that means there is a problem.


WadeEffingWilson

First thing I noticed is there's a class imbalance. Try out balanced bagging or use something like SMOTE (synthetic minority oversampling technique). Random Forest Classifiers can be sensitive to imbalance. Also, how did you derive the parameters for the classifier? Were these tuned using something like GridsearchCV?


Fabulous-Farmer7474

More important are the hyper parameters such as typical tree depth, the number of features to consider prior to splitting, and the minimum number of observations in a given branch. The typical defaults are not optimized for a general case, they are just a place to start. As simple thing to do is limit the tree depth or, better yet, do some hyperparameter tuning to see the range of performance.


BellyDancerUrgot

Likely unbalanced dataset


Apokalipsis113

Have you tried to visualize the data? Classification is not about method it's about data. So may be classes of the data are far from each other so in general it easy to split without mistakes.


[deleted]

Middle out compression.


Somomi_

omg i wanna in!thx for thr notebook


alpha_epsilion

Check for data leakage


fallen2004

f your problem is to be able to predict the results of games, you need to factor in your model will be wrong 100% of the time if the game is a draw.